What Happened to Microsoft Azure on Friday; Death by Security Certificate

An expired SSL (secure sockets layer) certificate just became the lethal bugaboo for Microsoft’s Azure Cloud Storage services when the expiration pulled down the cloud service on Friday afternoon. It took the software mega-giant less than a day to fix the problem and Microsoft announced on Saturday that the service had been entirely restored.

“Beginning Friday, February 22 at 12:44 PM PST, Storage experienced a worldwide outage impacting HTTPS operations (SSL traffic) due to an expired certificate,” Microsoft revealed on the Windows Azure service dashboard during the outage.

Not only did the outage affect much of the Azure service, but customers also reported issues struck Xbox Music and Video services—potential issues also reported by the company while the service was being restored.

“Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers,” the company said.

Repairs completed, but what led to a single expired certificate to cause a global outage?

This isn’t the first time that Microsoft Azure has suffered an outage that affected a multitude of their customers. Last year, also in February, Windows Azure Management Service went down and the problem spread to Windows Azure Compute. That outage was also caused by a certificate problem due to a date-related issue triggered by Leap Day.

The company, since contacted, has not revealed the source of how a certificate was permitted to expire or how the expiration itself led to the outage.

Microsoft isn’t the only cloud provider in the ecology to suffer from outages that lasted hours or even portions of days. Amazon Web Services has suffered several massive outages (July 2012, August 2012) as well has Google (September 2012, December 2012). These outages and the usefulness of cloud architectures that depend on systems such as AWS and Azure lead to questions of how to handle or crisis manage when the primary provider has a massive failure.

Companies such as Reddit and Netflix rely heavily on AWS. As a result, Netflix has been working on open source libraries for crisis recovery should their cloud-based infrastructure through AWS gets borked—but so far nobody has been able to survive massive outages without a scratch; although Netflix has been working on solutions to make sure customers still get served even if the cloud isn’t serving them.