Amazon sent out an official apology to users who have been affected by the massive AWS outage last week. Chances are you’re one of them.
On December 24, a member of the crew in charge of maintaining the Amazon Elastic Load Balancing Service accidentally deleted some data as a part of a routine maintenance process. This triggered a stream of API errors and performance issues that crippled the service until the next day, and took down many major North American AWS customers with it. Among them were and Heroku and Netflix, which couldn’t deliver content to users from Canada to South America.
Amazon provided a very detailed report of the incident. The company says that its engineers have spent several hours trying to restore a snapshot of the data from right before the incident; they had to backtrack at least once, but they got it done by 2:45 AD PST Christmas Day. With the data in hand, they embarked on the task of syncing it with recent settings changes.
By 8:15 AM the majority of the ELB APIs and workflows were up and running. Major progress was done over the course of the next two hours, and by the time 12:00 rolled around, the service was fully operational.
Amazon added that it has taken measures to make sure a similar incident will not repeat itself in the future.
“We have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval,” says an Amazon representative “Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data.”
Amazon also improved its data recovery policy based on what the developers learned during the restoration process, the statement adds.