Netflix to Open Source Chaos Monkey and More


Netflix will open source Chaos Monkey and the rest of its Simian Army the company’s Director of Cloud Architecture Adrian Cockcroft told Wired Enterprise. In fact, according to Cockcroft, Netflix will release “pretty much all” of its platform in bits and pieces through the summer and fall.

Chaos Monkey is a program that randomly kills instances and services throughout the Netflix architecture. The idea is to make all of Netflix’s components more robust – no one service should have any unnecessary dependencies. If the service that calculates a star rating for a user crashes, the user should still be able to stream videos on Netflix Streaming. I once logged into Netflix Streaming and found that although I couldn’t search for videos, I could still stream what was in my queue. That sort of resilience is part of how Netflix survived the Great Amazon Web Services Outage of 2011.

The rest of the Simian Army includes:

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly.

Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.

Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

10-18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

Why the open source all these tools? Couldn’t some competing video streaming service take these tools and use them to compete with Netflix? Cockcroft told Wired that open sourcing their work helps keep the Netflix team in line with what other cloud providers are doing and prevents them from becoming an outlier system, which makes hiring easier. It also makes hiring easier because developers like to work on open source.

For all of Netflix’s love of open source, Linux desktop users are still waiting for a way to stream Netflix, despite the fact that an solution already exists for streaming Netflix on ChromOS. Hmmm.