UPDATED 12:33 EDT / APRIL 29 2011

Everything You Ever Wanted to Know About the Amazon EC2 Crash

flaming-amazon-ec2-crash Last week generated the terrible, horrible, no good, very bad week for the cloud and specifically Amazon’s cloud when their service crashed and took down a multitude of sites that depended on it. It’s taken a week, but Amazon has finally generated a vast enumeration of exactly what went down and why—the explanation is not for the weak of technical heart. The cloud-mogul company has released an apology to their customers alongside the lengthy technical explanation.

All Things D picked up the apology and boiled it down to the point people with less technical savvy could understand,

It all started at 12:47 am PT on April 21 in Amazon’s Elastic Block Storage operation, which is essentially the storage used by Amazon’s EC2 cloud compute service, so EBS and EC2 go hand in hand. During normal scaling activities, a network change was underway. It was performed incorrectly. Not by a machine, but by a human. As Amazon puts it:

“The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

So, it looks like chances are it was a cascade effect multiplied by system complexity and all set in motion by human error. Think of it a lot like an avalanche triggered by a single misstep by a hiker on a rocky ridge; a small number of pebbles starts moving and then more get caught up and then you’ve got an unstoppable collapse underway.

When the crash happened, it took down a multitude of websites that rely heavily on the EBS and EC2 and storage services such as Foursquare and Reddit. In fact, Reddit didn’t manage to get itself back in motion until almost 12 hours after Amazon reportedly restored more than 90% of their service due to sheer latency through the system. And now that the dust has settled, we’re still waiting on the numbers from Amazon and others about how much data was lost in the debacle—even if it’s something like one-tenth of a percent, that’s still a notable amount of data. Business Insider asked just this question and heard from ChartBeat who lost over 11 hours of historical data out of the cloud catastrophe.

The crash has highlighted a lot of issues mentioned by critics of cloud computing as mimicking monolithic systems in failure modes. It reminds me of RAID setups which run in two flavors: one that allow for much faster data read and write; and one that mirrors data so that if a drive explodes, the data can be recovered. Right now, Amazon’s cloud seems to give us a lot of speed, but when something goes wrong it affects almost the entire setup and sometimes causes data losses.

Perhaps it’s just the year for data disasters, as we might all recall when a cloud crash happened to Google in early March, they spun out most of the data from tape drives in order to restart the system. Extra redundancy does create a great deal of extra overhead for complex systems and it only comes into play when something goes catastrophically wrong. So it’s often a risk vs. reward decision on the part of a company, a sort of gamble deciding how to manage redundancy against profitability.

This is what makes up the crux of current rumblings about the usefulness of the cloud; but more likely these incidents highlight instead growing pains of an emerging technology. Cloud-computing and -storage will likely behave like any other industry. Some companies will brand themselves as more stable than others. It does provide high bandwidth computing and storage for a fraction of the cost of maintaining your own sever farm. It depends on its own infrastructure and right now there’s only a few very big players in the market and those who can show that they have the best crash and disaster recovery protocols will probably draw in and hold customers.

At least having Amazon come out that it was in fact human error does finally dismiss the speculation that Terminator’s Skynet was involved. (Way to go Amazon, wink-wink, nudge-nudge.)


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU