AWS explains last week’s cloud outage and promises to improve its status page
Amazon Web Services Inc. said this weekend that it will make changes to its cloud Service Health Dashboard in the wake of a major outage last week that took multiple connected services, including financial apps and food delivery platforms, offline for several hours.
In a report on the impact of the event last Friday, Amazon said the problems first began at its US-East-1 data center region in Virginia at 10.30 a.m. EST on Tuesday, Dec. 7.
Amazon blamed an “automated activity” that was meant to scale capacity for one of its services hosted in the main AWS network. That activity apparently triggered “unexpected behavior” from a large number of clients within the internal network. As a result, multiple devices connecting an internal Amazon network with an AWS network became overloaded.
The incident hurt AWS cloud services such as AWS EC2, which provides virtual server capacity for multiple enterprises. Many services were taken offline for several hours, resulting in widespread disruption for Amazon’s customers. Reports said popular streaming services such as Netflix and Disney+ went down, while connected devices such as Amazon.com Inc.’s Ring security cameras and iRobot Corp’s Roomba vacuums also stopped working.
Amazon suffered too, because many of its warehouse and delivery employees use applications powered by AWS to do their jobs. Reports said Amazon workers were unable to scan packages or see their delivery routes for much of Tuesday as they waited for AWS engineers to restore service.
Some AWS services came back online within a few hours, but others, such as the developer tool AWS EventBridge, didn’t return fully until 9.40 p.m. EST.
AWS is generally a very reliable service. The last major incident affecting AWS occurred in 2017, when an employee accidentally turned off more servers than intended during repairs of a billing system. But Tuesday’s outage was a big blow to AWS’ reputation, undermining claims that cloud infrastructure is reliable and enterprise-ready. AWS apologized to its customers for the disruption.
AWS also admitted it struggled to keep customers aware of what was happening during the incident. It had problems updating its Service Health Dashboard, which is the primary status page for AWS customers. Many customers also complained they were unable to create support tickets during the disruption.
“As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue,” AWS said.
Many customers also complained they were unable to create support tickets during the disruption.
Constellation Research Inc. analyst Holger Mueller told SiliconANGLE that cloud outages are always unfortunate and they’re also highly visible with so many services dependent on the public cloud today. He said AWS has provided a bit more insight into what happened than is usually the case, as the outage took down two networks, the regular AWS network and also an internal one that’s used for a number of its own services.
“That a regular networking scaling request took the junction to the internal network down is a surprise, because the the scaling code has likely been in production for a number of years already,” Mueller said.
The analyst said the disruption was interesting because AWS has always maintained that its own activity is cordoned off from that of its commercial AWS customers, yet both were impacted. He also said the stability of Amazon’s internal and customer-facing tools clearly need to be made more reliable.
“It’s time AWS makes these support systems and its dashboard more solid, and least dependent on the ‘Uber Region” US-EAST,” Mueller said. “In any case AWS deserves kudos for its transparency in the postmortem, let’s just hope it learns from this and that its systems will be more resilient going forward.”
AWS has promised to take action, with a new version of the Service Health Dashboard arriving in early 2022 that will make it easier to understand service impact. It’s also planning to launch a new support system architecture that spans multiple AWS regions to ensure there will be no delays in communicating with customers.
A message from John Furrier, co-founder of SiliconANGLE:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.
We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.