What We can Learn from What Netflix Learned from the AWS Failure


Late in June (last month) the Amazon Web Services suffered a catastrophic series of failures often now referred to as an “AWS Storm” and Netflix found themselves caught up in the middle of it as their own servers were intimately linked to Amazon’s. Of course, an actual storm caused the AWS storm, but the effects on networked services were quite similar to having lightning strike your movie streaming. Learning from having underlying cloud-architecture fail and discovering how to make a network more resilient for customers is a big deal so Netflix went to work.

Before this happened, Netflix realized that in order to prepare for failures, simulating failures would be a way to watch how the entire system handled them. So the company came up with “Chaos Gorilla” part of the Simian Army, a services system that essentially generates unexpected failures in the resilient system in small doses to keep operations on their toes and aware.

What did Netflix learn from the failure we can use?

“Our own root-cause analysis uncovered some interesting findings, including an edge-case in our internal mid-tier load-balancing service,” wrote Netflix in an operations blog on the subject. “This caused unhealthy instances to fail to deregister from the load-balancer which black-holed a large amount of traffic into the unavailable zone. In addition, the network calls to the instances in the unavailable zone were hanging, rather than returning no route to host.”

These revelations led to some thoughts on what both AWS and Netflix could do to better prepare their services for the unexpected starting with enhancing middle-tier load balancing to better handle shifting load without failing and preventing gridlock when clients who suddenly lost connection try to reconnect and find themselves in a traffic jam.

The release of Chaos Monkey into the wild to show everyone it works

One element of the architecture that Netflix credits for working extremely well through the storm, Simian Army aka Chaos Monkey, is now being released to the community as source code. Chaos Monkey acts to give an element of “fail often” to the underlying architecture that doesn’t disrupt the overall efficiency of the service—but does help highlight potential trouble areas in back up systems before they’re needed to deal with a big event like an AWS storm.

It’s a bizarre idea: generate controllable failures during work hours so that engineers are on call to deal with them, but it seems that Netflix thinks it’s the technological panacea to not seeing a failure coming until it’s too late.

“Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support,” writes Netflix about the project. “In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems.

“With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.”

It’s a novel approach: Break yourself before something bigger breaks you.

Additional tools for preparing for disasters by providing testing metrics

The ability to create controllable failure and watch the system recover in AWS might be an excellent tool to combine with real-time log analysis such as that which Splunk offers—as we’ve seen it’s a powerful tool for protecting critical infrastructure not just for cybersecurity but general maintenance. Often, there are warning signs that a portion of the infrastructure is going to fail, anomalies start to crop up, load shifts to places it shouldn’t be as a RAID array is about to give up the ghost, or a network card is glitching out…but everything changes when it’s the backup that’s on the verge of failing.

Backup doesn’t run when things are going well, it runs when things are going badly, so even the backup and the fail-over servers need to be tested.

Looking at the paradigm that Netflix is developing for the AWS with Chaos Monkey will help deliver data on how healthy the backup systems are during business hours (so that failures can be dealt with) and combined with strong data analysis will keep their systems in good working order.

Fail now and small when you can replace the components that might cause trouble in a larger failure and you won’t need to worry about them when lightning hits the transformer outside one of your major data centers.