Microsoft dealt with some pretty significant outages of its Office 365 services this week. Customers and partners are reacting. Some outraged. Some disappointed that they had subscribed to a service that was allegedly more stable and secure than within their own infrastructure. The first of the two delays happened November 8th, where significant delays in email delivery were experienced. The next happened on November 13 where email access was intermittent throughout the day.
In an email to Microsoft Office 365 customers, the company reported the following:
The Office 365 team strives to provide exceptional service to all of our customers. On Thursday, November 8 and Tuesday, November 13 we experienced two separate service issues that impacted customers served from our data centers in the Americas. We apologize for the inconvenience these issues caused you and your employees.
We are committed to communicating with our customers in an open and honest manner about service issues and the steps we’re taking to prevent recurrences……
• The first service incident occurred on November 8 and resulted in prolonged mail flow delays for many of our customers in North and South America. Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers’ inboxes. Going forward, we have built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.
• On November 13, some customers in North and South America were unable to access email services. This service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service. These three issues in combination caused customer access to email services to be degraded for an extended period of time. Significant capacity increases are already underway and we are also adding automated handling on these type of failures to speed recovery time.
As we review this postmortem it is clear that these types of issues can have widespread effects that were experiences. Also note, that we have seen significant impact incidents in the past with service outages across many different cloud-based services. The point here is Microsoft is taking these incidents, analyzing them and aiming to improve:
Across the organization, we are executing a full review of our processes to proactively identify further actions needed to avoid these situations.
The lessons that come out of this could come in the form of many operational and technical processes. Back in May 2011, when the service was known as BPOS, a large outage occurred and the follow-up statement from Microsoft stated that they had updated their communications procedures to be more extensive and timely. Many of the complaints this time around revolved around the availability of communications during these outages. Outages happen, and are painful when unexpected, yet insiders have told me the Office 365 environment is actually extraordinarily resistant to service impacting incidents.
Rejoice! – Service Credits
Note that outages of this scope and magnitude are increasingly rare and indeed backed by Microsoft’s money-backed SLA uptime guarantee. In fact the company has already gone beyond its commitments in the form of a service credit:
As a gesture of our commitment to ensuring the highest quality service experience Microsoft is changing the standard credit procedure for this incident and is proactively providing your organization a credit equal to 25 percent of your monthly invoice.
By the way that is a money guarantee that you will not find with Google Apps, a competitor in the cloud email space. For those that point to independent hosted cloud email providers as an alternative to Microsoft, well the fact of the matter is those products are generally not really comparable at all – they don’t typically included Office, or Lync, Sharepoint, or Lync – one of those products is usually not on the list.
Don’t Turn Away from Office 365 (or cloud services)
Rest assured Microsoft is addressing these issues, as they have done in the past and they are learning from them so they will not happen again. This kind of process improvement has a collateral effect, spilling over to other potential scenarios and improving the product reliability overall. If people start running for the exits on Office 365 because of a couple of incidents that have been rectified, and even further financially recompensed, then they are missing out on all the product has to offer – from improved business resource utilization, to worker productivity, to ease of administration and so on. In the big picture, while the effects were widespread, the outage throughout the year amount to but a mere blip on radar if that. There aren’t many organizations running their own infrastructure with that kind of track record. Bottom line, Office 365 is still a significant advantage technology for organizations, and among other cloud services that are available in the industry is as strategically risk-averse as any in-house operation.