Microsoft has just published a blow-by-blow account of what went wrong during last month’s Azure cloud storage service outage, which caused thousands of websites, including its own Windows Store and MSN.com, to go offline.
The Microsoft Azure service interruption on Nov. 18 resulted in intermittent connectivity issues with the Azure Storage service in multiple regions, and it was all due to a simple, human error.
Jason Zander, Microsoft’s Vice President for Azure, admitted as much in a blog post, saying “there was a gap in the deployment tooling that relied on human decisions and protocol”.
At the time of the outage, Microsoft had said the problem was caused due to “… an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting (testing).”
Zander explains further:
“There are two types of Azure Storage deployments: software deployments (i.e. publishing code) and configuration deployments (i.e. change settings). Both software and configuration deployments require multiple stages of validation and are incrementally deployed to the Azure infrastructure in small batches. This progressive deployment approach is called ‘flighting.’”
“When flights are in progress, we closely monitor health checks. As continued usage and testing demonstrates successful results, we will deploy the change to additional slices across the Azure Storage infrastructure.”
Now, Microsoft says its analysis of the outage shows that “faulty flighting” was what caused the problem. It began with a software change Microsoft’s enginers made to improve Azure Storage’s performance by reducing the CPU footprint of the Azure Storage Table Front-Ends.
Unfortunately for Microsoft, someone forgot to follow protocol. “The standard flighting deployment policy of incrementally deploying changes across small slices was not followed,” said Zander. The engineer peforming the upgrade apparently thought that doing so was low risk, because the changes had already been flighted on a portion of the production infrastructure for several weeks.
Alas, that was not the case, “because the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends,” Zander said. As a result, when the change was initiated this exposed a bug that caused the Azure Blob storage Front-Ends to enter an infinite loop, meaning it was unable to service requests.
To prevent this kind of cock-up from happening again, Microsoft has since updated its deployment system to enforce its flighting policies for all standard updates, be it code or configuration.
photo credit: o.tacke via photopin cc
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.