UPDATED 12:33 EDT / APRIL 29 2011

Everything You Ever Wanted to Know About the Amazon EC2 Crash

flaming-amazon-ec2-crash Last week generated the terrible, horrible, no good, very bad week for the cloud and specifically Amazon’s cloud when their service crashed and took down a multitude of sites that depended on it. It’s taken a week, but Amazon has finally generated a vast enumeration of exactly what went down and why—the explanation is not for the weak of technical heart. The cloud-mogul company has released an apology to their customers alongside the lengthy technical explanation.

All Things D picked up the apology and boiled it down to the point people with less technical savvy could understand,

It all started at 12:47 am PT on April 21 in Amazon’s Elastic Block Storage operation, which is essentially the storage used by Amazon’s EC2 cloud compute service, so EBS and EC2 go hand in hand. During normal scaling activities, a network change was underway. It was performed incorrectly. Not by a machine, but by a human. As Amazon puts it:

“The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.”

So, it looks like chances are it was a cascade effect multiplied by system complexity and all set in motion by human error. Think of it a lot like an avalanche triggered by a single misstep by a hiker on a rocky ridge; a small number of pebbles starts moving and then more get caught up and then you’ve got an unstoppable collapse underway.

When the crash happened, it took down a multitude of websites that rely heavily on the EBS and EC2 and storage services such as Foursquare and Reddit. In fact, Reddit didn’t manage to get itself back in motion until almost 12 hours after Amazon reportedly restored more than 90% of their service due to sheer latency through the system. And now that the dust has settled, we’re still waiting on the numbers from Amazon and others about how much data was lost in the debacle—even if it’s something like one-tenth of a percent, that’s still a notable amount of data. Business Insider asked just this question and heard from ChartBeat who lost over 11 hours of historical data out of the cloud catastrophe.

The crash has highlighted a lot of issues mentioned by critics of cloud computing as mimicking monolithic systems in failure modes. It reminds me of RAID setups which run in two flavors: one that allow for much faster data read and write; and one that mirrors data so that if a drive explodes, the data can be recovered. Right now, Amazon’s cloud seems to give us a lot of speed, but when something goes wrong it affects almost the entire setup and sometimes causes data losses.

Perhaps it’s just the year for data disasters, as we might all recall when a cloud crash happened to Google in early March, they spun out most of the data from tape drives in order to restart the system. Extra redundancy does create a great deal of extra overhead for complex systems and it only comes into play when something goes catastrophically wrong. So it’s often a risk vs. reward decision on the part of a company, a sort of gamble deciding how to manage redundancy against profitability.

This is what makes up the crux of current rumblings about the usefulness of the cloud; but more likely these incidents highlight instead growing pains of an emerging technology. Cloud-computing and -storage will likely behave like any other industry. Some companies will brand themselves as more stable than others. It does provide high bandwidth computing and storage for a fraction of the cost of maintaining your own sever farm. It depends on its own infrastructure and right now there’s only a few very big players in the market and those who can show that they have the best crash and disaster recovery protocols will probably draw in and hold customers.

At least having Amazon come out that it was in fact human error does finally dismiss the speculation that Terminator’s Skynet was involved. (Way to go Amazon, wink-wink, nudge-nudge.)

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Everything You Ever Wanted to Know About the Amazon EC2 Crash

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

IBM Think 2026

Everything You Ever Wanted to Know About the Amazon EC2 Crash

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Pure Accelerate 2026

FinOps X 2026

Snowflake Summit 2026

Freshworks Refresh 2026

IBM Think 2026

Cookies