Microsoft is one of the latest companies trying to clean up reputation damage inflicted by an outage of its cloud services. Mike Neil, the general manager in charge of Microsoft’s Windows Azure, wrote a blog post explaining the reason for the service interruption that hit customers in Western Europe last week. The blackout happened on July 26 and made Azure’s Compute Service unavailable for about two and a half hours.
Neil explained Microsoft managed to trace the issue to a networking glitch that snowballed into a massive service disruption. According to Neil,
“The service interruption was triggered by a misconfigured network device that disrupted traffic to one cluster in our West Europe sub-region. Once a set device limit for external connections was reached, it triggered previously unknown issues in another network device within that cluster, which further complicated network management and recovery.”
Although Microsoft restored the service and knows the networking problem was a catalyst, the root cause of the issue has not been determined. Microsoft is working hard to change that. Neil said Microsoft has assigned a lot of manpower to analyze the outage and discover its source. Neil said more details would be posted on the official Azure blog sometime later this week.
The cloud has turned into a huge phenomenon in the last couple of years, and services like AWS and Azure are spearheading it. The two cloud behemoths are dominating the IaaS space with user bases that span individual developers and large enterprises. This popularity is obviously good for business, but its also a burden. Every outage or misstep is magnified in the public eye.
The 2.5 hour Azure outage only impacted one portion of its customers, but that’s still a large number of users – including companies that rely on the service to run their business. Microsoft isn’t the only cloud service with recent problems. AWS also had an outage earlier this month. An electrical issue at the company’s North Virginia availability zone took down a number of major services including Netflix and Instagram. Not only did the outage impact AWS’ costumers, it impacted their customers’ customers and led to a number of other issues that companies like Netflix had to resolve after the service was restored.
The cloud has enormous potential, but users have to realize that cloud services can down just like any other IT component. The cloud doesn’t eliminate the need for SLAs, redundancy or sound architectural practices. The cloud may be great, but it’s not magical.