Somewhere along the line, Netflix became the bellwether for cloud development, as its immensely successful video streaming service has put it in the spotlight, both for its usage of scalable public cloud infrastructure and its custom with Amazon Web Services for the same.
And now, in the wake of last weekend’s great Amazon Web Services outage, Netflix has reaffirmed its commitment to the provider and given an overview of the lessons learned.
The outage, as you likely know by now, was caused last Friday evening by an electrical storm in the North Virginia AWS Availability Zone (AZ). For Netflix, even though AWS came back up 20 minutes after the initial failure, an API backlog and capacity problems held up the vital AWS Elastic Load Balancing (ELB) service from its normal performance baseline.
In practical terms, this meant Netflix customers were finding themselves attempting to connect to unhealthy instances that weren’t properly deregistering with ELB, causing a “black hole” of an unavailable zone to emerge. And network calls into that black hole were just left hanging.
As the Netflix blog entry put it:
In our middle tier load-balancing, we had a cascading failure that was caused by a feature we had implemented to account for other types of failures. The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously. This was done to deal with network partition events and was intended to be a short-term freeze until someone could investigate the large-scale issue. Unfortunately, getting out of this state proved both cumbersome and time consuming, causing services to continue to try and use servers that were no longer alive due to the power outage.
Fortunately for those Netflix users, some of the company’s much-touted open source and other internal cloud resiliency projects paid off. Regional isolation of the issue meant that European customers were left unaffected. And Cassandra, Netflix’s cross-region distributed cloud persistence store, kept a third of the regional nodes running with no loss in availability.
Chaos Gorilla, one of Netflix’s roster of Simian Army cloud availability loss simulation tools (which also includes the recently open-sourced Chaos Monkey), is designed to simulate the failure of an AWS availability zone, and if nothing else, this incident is helping Netflix refine its approach.
The final word: Netflix says that the cloud is rapidly maturing, and that it’s working with Amazon Web Services to help these kinds of cases from ever happening again. What’s more, despite any Amazon outages, Netflix says that it’s still seen a general rise in availability since it moved from its data center to Amazon’s public cloud.
So that’s a strong vote for public cloud from Netflix. But despite the company’s prominence, many are still wary of cloud availability disruptions, and these outages only make them more so. It’s just a matter of learning from mistakes, refining your approach and moving forward.