UPDATED 12:33 EDT / APRIL 29 2011

Everything You Ever Wanted to Know About the Amazon EC2 Crash

flaming-amazon-ec2-crash Last week generated the terrible, horrible, no good, very bad week for the cloud and specifically Amazon’s cloud when their service crashed and took down a multitude of sites that depended on it. It’s taken a week, but Amazon has finally generated a vast enumeration of exactly what went down and why—the explanation is not for the weak of technical heart. The cloud-mogul company has released an apology to their customers alongside the lengthy technical explanation.

All Things D picked up the apology and boiled it down to the point people with less technical savvy could understand,

It all started at 12:47 am PT on April 21 in Amazon’s Elastic Block Storage operation, which is essentially the storage used by Amazon’s EC2 cloud compute service, so EBS and EC2 go hand in hand. During normal scaling activities, a network change was underway. It was performed incorrectly. Not by a machine, but by a human. As Amazon puts it:

“The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

So, it looks like chances are it was a cascade effect multiplied by system complexity and all set in motion by human error. Think of it a lot like an avalanche triggered by a single misstep by a hiker on a rocky ridge; a small number of pebbles starts moving and then more get caught up and then you’ve got an unstoppable collapse underway.

When the crash happened, it took down a multitude of websites that rely heavily on the EBS and EC2 and storage services such as Foursquare and Reddit. In fact, Reddit didn’t manage to get itself back in motion until almost 12 hours after Amazon reportedly restored more than 90% of their service due to sheer latency through the system. And now that the dust has settled, we’re still waiting on the numbers from Amazon and others about how much data was lost in the debacle—even if it’s something like one-tenth of a percent, that’s still a notable amount of data. Business Insider asked just this question and heard from ChartBeat who lost over 11 hours of historical data out of the cloud catastrophe.

The crash has highlighted a lot of issues mentioned by critics of cloud computing as mimicking monolithic systems in failure modes. It reminds me of RAID setups which run in two flavors: one that allow for much faster data read and write; and one that mirrors data so that if a drive explodes, the data can be recovered. Right now, Amazon’s cloud seems to give us a lot of speed, but when something goes wrong it affects almost the entire setup and sometimes causes data losses.

Perhaps it’s just the year for data disasters, as we might all recall when a cloud crash happened to Google in early March, they spun out most of the data from tape drives in order to restart the system. Extra redundancy does create a great deal of extra overhead for complex systems and it only comes into play when something goes catastrophically wrong. So it’s often a risk vs. reward decision on the part of a company, a sort of gamble deciding how to manage redundancy against profitability.

This is what makes up the crux of current rumblings about the usefulness of the cloud; but more likely these incidents highlight instead growing pains of an emerging technology. Cloud-computing and -storage will likely behave like any other industry. Some companies will brand themselves as more stable than others. It does provide high bandwidth computing and storage for a fraction of the cost of maintaining your own sever farm. It depends on its own infrastructure and right now there’s only a few very big players in the market and those who can show that they have the best crash and disaster recovery protocols will probably draw in and hold customers.

At least having Amazon come out that it was in fact human error does finally dismiss the speculation that Terminator’s Skynet was involved. (Way to go Amazon, wink-wink, nudge-nudge.)


Since you’re here …

Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!

Support our mission:    >>>>>>  SUBSCRIBE NOW >>>>>>  to our YouTube channel.

… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.