Cloud Failure Is Inevitable: Learn From the Crash and Move On

Every time there’s a major cloud outage, we have to have the same conversation around the reliability and the viability of the public cloud as an enterprise IT platform. And now that Amazon Web Services has released a detailed post-mortem on its outage over the weekend (Short version: An electrical storm in North Virginia caused a service disruption), it’s back to the same old cloud debate.

The thing with the cloud is that it’s easy to forget that it’s based in a data center similar to any other. Data centers fail every single day, in any number of ways, and you never hear about it. It’s only because any given cloud data center might be hosting any number of web-scale production services that it makes headlines. But at the same time, it’s that consolidation that makes it an issue.

GigaOM has a thoughtful analysis of what cloud reliability means in the modern era, with insight from Geoff Arnold, billed as an “industry consultant and entrepreneur-in-residence at U.S. Venture Partners.” Arnold makes very good points, including the fact that fault tolerance at Amazon Web Services isn’t as good as it could be simply because the cloud services giant is trying to keep prices low – if it charged customers more, it would have more money to throw at the problem, but it won’t so it can’t.

But that’s a problem that’s largely specific to Amazon Web Services (obviously other cloud service providers are under price pressures as well, but AWS is committed to being the go-to cheap compute provider in the cloud).

The thing is, though, that it’s also a question of complexity, which is a problem increasingly faced by IT pros all over the world. Software engineers are increasingly required to strap together hardware and software that were never designed to work together, across public and private infrastructures, increasing the number of failure points – and thusly the likelihood of failure itself.

An obvious pill for the complexity problem is automation. But in a conversation with Christopher Brown, CTO of Opscode, developers of the popular Chef automation tool, it became clear that there’s no one-size-fits-all solution for addressing these problems across heterogeneous infrastructures.

The market as a whole is really good at telling an arbitrary group of servers how to behave, Brown says, but maybe not so much at getting those disparate groups to talk to each other. Something that’s going to be discussed a lot over the next several months is orchestration across these increasingly complex and geographically dispersed infrastructures. Opscode (and presumably Puppet Labs and any of the other automation solution vendors) are working on smoothing this road ahead.

But in the meanwhile, there’s simply nothing to do but keep in mind the lessons that we learned the last time Amazon Web Services went down: For some workloads, the private cloud is better, and for some, the public cloud is better. But it’s increasingly important to learn from failures, plan your application infrastructure around them, deploy across multiple clouds, and above all keep your data as spread out across physical locations as possible.