Adrian Cockcroft Architecture Director at Netflix in #theCube Talking Chaos Monkey and Cassandra


John Furrier sat down in theCube to speak with Adrian Cockcroft, Director of architecture for the Cloud Systems team at Netflix, and together they spoke about how Netflix handled the recent power outages, the AWS cloud failure, and how their system has grown from its inception into the behemoth of public cloud power it has become today. It’s been a long, strange road, but in the end the video-streaming network has survived some major outages and come through standing tall.

According to Adrian Cockcroft, Netflix has spent the last year getting everything running on Cassandra and Amazon (AWS) . He says that as they moved to the cloud, they’ve gotten much more reliable and present—and the streaming service runs entirely on the cloud.

“We had 4 goals when we did the cloud migration for Netflix: low latency, scalability, reliability–your TV set should just work–and be very productive for developers. We wanted to get everything out of the way for the developers.” Cockcroft spoke about how when he was preparing the demo he could just click a button, build out the set of machines, run the build, and turn it off again in less than a hour.

“We call the data center guys ‘server huggers’ sometimes,” says Cockcroft, it’s all about not wanting to let go of the machines in the data center—and that’s in opposition to the cloud–but here he is making a bunch of virtual machines in a cloud come to life at the click of a button. This sort of instant availability is demonstrably good for developers and operations because you can requisition time and compute in almost no time at all.

He says that he likes Amazon because they’re very reliable. One time an instance that they launched failed to work properly—so Amazon just killed it off and instanced a new one—it probably only delayed the launch by about 5 minutes total.

Cassandra is also very good for scalability because you can just keep adding extra storage space and compute to the cluster. With a normal cluster it’s possible to choke the network and have it die off from lack of “oxygen” while it’s trying to grow; but with Cassandra it organizes the expansion so that nothing gets starved out and performance doesn’t suffer as its scaling out.

When Furrier asked how much the demo cost, Cockcroft explained that it probably only cost about $150 ($100 for one cluster, and $50 for another) and it came possibly to just a fraction of that cost because he was only using it for just a short time. He didn’t need it before it came to life, and he let it dissolve back into the cloud after he was done. This shows how rapid-instantiation cloud systems can dilute costs a great deal when it’s possible to requisition computation and storage out of the ether and then release it and only pay for the time you use it.

Netflix and how to prepare for reliability and data protection

Furrier said that Netflix is in a place where the can play with a lot of things and they can do a lot of R&D but they have the DevOps challenge as well such as in high availability and data protection.

Cockcroft talked about two power failures and how the backend systems were affected (machines had to be replaced) but Cassandra kept the entire performance network running even when machines fried. The second outage managed to take a lot more out of commission because of a single code-line bug that hadn’t been discovered yet; but Netflix has a specialized test system to discover those now that should prevent those in the future.

Netflix implemented a methodology called Chaos Monkey and Chaos Gorilla which are capable of randomly knocking out entire clusters or zones while the system is running—with Cassandra to shift load to make sure performance doesn’t suffer. The useful takeaway of this testing protocol is that it’s “unpredictable” failures happening while engineers are watching so that if something goes wrong it can be fixed and it will also highlight potential failure points and get them repaired before they become major issues.