UPDATED 07:00 EDT / DECEMBER 19 2014

Microsoft admits “human error” led to last month’s Azure outage

small__13990410803Microsoft has just published a blow-by-blow account of what went wrong during last month’s Azure cloud storage service outage, which caused thousands of websites, including its own Windows Store and MSN.com, to go offline.

The Microsoft Azure service interruption on Nov. 18 resulted in intermittent connectivity issues with the Azure Storage service in multiple regions, and it was all due to a simple, human error.

Jason Zander, Microsoft’s Vice President for Azure, admitted as much in a blog post, saying “there was a gap in the deployment tooling that relied on human decisions and protocol”.

At the time of the outage, Microsoft had said the problem was caused due to “… an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting (testing).”

Zander explains further:

“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’”

“When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”

Now, Microsoft says its analysis of the outage shows that “faulty flighting” was what caused the problem. It began with a software change Microsoft’s enginers made to improve Azure Storage’s performance by reducing the CPU footprint of the Azure Storage Table Front-Ends.

Unfortunately for Microsoft, someone forgot to follow protocol. “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed,” said Zander. The engineer peforming the upgrade apparently thought that doing so was low risk, because the changes had already been flighted on a portion of the production infrastructure for several weeks.

Alas, that was not the case, “because the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends,” Zander said. As a result, when the change was initiated this exposed a bug that caused the Azure Blob storage Front-Ends to enter an infinite loop, meaning it was unable to service requests.

To prevent this kind of cock-up from happening again, Microsoft has since updated its deployment system to enforce its flighting policies for all standard updates, be it code or configuration.

photo credit: o.tacke via photopin cc


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU