Content streaming and movie rental service Netflix utilizes cloud computing to power its core operations. In fact, the bulk of Netflix’s infrastructure is cloud-based, and it is one of Amazon Web Services’ (AWS) largest customers. Netflix has developed an entire arsenal of tools that help it manage its massive cloud environment and more efficiently manage outages and technical issues.
Netflix refers to these tools as the Simian Army. The software includes colorful named items like Latency Monkey, Chaos Gorilla and Chaos Monkey. If you couldn’t by the name, Chaos Monkey is a scaled-down version of Chaos Gorilla. (Who says developers don’t have a sense of humor.)
Chaos Monkey is a service that runs on AWS and improves application resiliency by helping ensure an application can remain running if an instance unexpectedly shuts down – a universally helpful capability for any cloud-based application. Chaos Monkey works by randomly killing instances. If an application is well designed, the outage of a single node shouldn’t impact it. Developers can use the service to identify unnecessary dependencies and weed out architectural problems. Chaos Monkey was developed for AWS, but according to Netflix it is flexible enough to work with other cloud providers.
As promised in April, Netflix has made the code publicly available as open source. The company announced the Chaos Monkey’s open source launch in an official blog post. According to the post, developers that use the service can be confident the tool has already been field tested. The announcement explained,
“Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don’t happen again.”
The code for Chaos Monkey is available on GitHub. In addition to Chaos Monkey, Janitor Monkey, a tool similar to Cloudability that tracks down unused resources, might be the next open source candidate.
Incidents like the recent Amazon outage and Azure’s Western European blackout show the importance of such solutions. In spite of Netflix’s preparation, the AWS failure still managed to take the service down. Netflix’s availability architecture did manage to reduce the impact of the damage.