INFRA
INFRA
INFRA
When Amazon Web Services Inc. suffered a region-wide outage last week that disrupted services across the U.S. East Coast, Snowflake Inc. saw an opportunity to remind customers that disruptions don’t have to be disasters.
Snowflake said more than 300 critical workloads using its Snowgrid feature were able to maintain operations with only minimal interruption by failing over to alternate cloud regions.
“There were a number of folks in the middle of an outage trying to find out what was happening and how they could recover quickly,” said Christian Kleinerman, Snowflake’s chief product officer. “Customers who chose to leverage our business continuity capabilities continued their operations as if nothing had happened. It was a non-event.”
Introduced in 2022, Snowgrid enables organizations to replicate workloads across regions on the three major public clouds and to shift data processing and client connections to alternate sites during a disruption. Failovers are initiated by individual customers based on predefined scenarios. Kleinerman said Snowflake built its service from the ground up to support transactional consistency and low-latency replication.
Snowgrid has three basic components. It can be configured to replicate data from one regional cloud to another. When a disruption occurs in the primary region, customers can trigger a failover to the designated secondary region to shift processing. Workloads resume where they left off without data loss or duplication. Snowgrid automatically redirects client applications to the secondary region using updated Domain Name System entries behind the scenes, so most users see only a brief blip before operations resume.
“Anyone who has been doing databases for a while realizes that if your line items don’t match your orders because the point in time didn’t match on both sides, you have frenzy and chaos until you fail over,” Kleinerman said.
Once replication is in place, Snowflake continuously manages the state of data and workloads. If a disruption appears likely to last more than a few minutes, users can trigger failover manually, shifting operations to another region or cloud in less than a minute.
One of those customers was Vermont Information Processing Inc., a software provider for the beverage industry that serves more than 1,200 suppliers and 400 distributors. Director of Applications Chris McGinty said his team noticed problems in AWS’s U.S. East region early on October 20.
“We first heard about it from one of our operations team members, who noticed they couldn’t log into the cloud console,” McGinty said. “Around 4 a.m., we noticed we could no longer access the AWS U.S. East 1 console for internal operations.”
Although Snowflake’s services were initially unaffected, internal monitoring showed signs of degraded performance. “Within a matter of about five minutes, we had all of our workloads running on our secondary U.S. West location,” McGinty said. “Our applications saw no real downtime.”
McGinty said Snowgrid isn’t a “set it and forget it” proposition but requires forethought. “We were confident that what we had to do was going to work, and it did,” he said. “We did a lot of testing. It’s great to have plans and infrastructure in place, but you have to test.”
Kleinerman said Snowgrid’s client redirection feature is key. By automatically updating DNS entries, customer tools like business intelligence dashboards can reconnect without user intervention. “They’ll see a blip for a minute or so, and then everything continues to work as if nothing had happened,” he said.
Snowgrid is a paid, optional feature that only about one-quarter of Snowflake customers use. Kleinerman said many companies are confused about the effectiveness of AWS availability zones, which are isolated data center locations within a region that can provide a level of fault tolerance. Last week’s disruption demonstrated that, in the event of a failure affecting shared services like identity or DNS, availability zones may not be sufficient protection.
The AWS incident was triggered by a failure in the U.S. East region’s DNS, affecting control plane services and leaving many users without access even if their workloads were technically running in alternate availability zones. “Availability zones don’t mean business continuity,” Kleinerman said.
Vermont Information Processing’s McGinty said his organization has no plans to change its setup in response to the outage. “I think we’ve made the right investment,” he said. “It seems like everything’s in place for us to be able to operate successfully when these types of major incidents happen.”
While Snowflake was clearly eager to promote Snowgrid in the wake of the outage, the broader lesson is less about a vendor than about architecting for resilience, Kleinerman said.
“Outages will happen,” he said. “If you prepare, you’re going to have an uneventful day. If you don’t, you’re going to have a busy Monday.”
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.