UPDATED 20:06 EST / JULY 15 2019

CLOUD

Microsoft reveals how it’s planning to make its Azure cloud even more reliable

Microsoft Corp. says the current 99.995% average uptime of its Azure public cloud infrastructure offering simply isn’t good enough, so it’s taking steps to improve it even more.

In a blog post today, Chief Technology Officer Mark Russinovich noted how Azure’s availability was hurt by “three unique and significant incidents” in the last 12 months.

Those included a major data center outage in its South Central U.S. region back in September and Azure Active Directory Multi-Factor Authentication problems in November. Then there were domain name system maintenance issues in May, which led to further outages for some customers.

Those and other incidents simply won’t do, Russinovich said. In response, the company has created what it calls a “Quality Engineering” team reporting to him that will work alongside its existing Site Reliability Engineering team to come up with ways to beef up Azure’s durability.

The team has already began a number of initiatives to ensure the resiliency of Azure. For example, the company is planning by 2021 to add new availability zones to the 10 largest Azure regions that don’t currently have them. The biggest 10 Azure regions already have availability zones, which help guard against data center-level failures, Russinovich said. Each zone is located within an Azure region and has its own independent power source, network and cooling infrastructure.

The company is also expanding its safe deployment practice framework, which ensures that all code and configuration changes in Azure must pass a set of stringent tests before rolling out to different regions. The framework will be expanded to include all software-defined infrastructure changes in Azure, including alterations to its networking and DNS infrastructure.

Microsoft is also launching in preview the ability for customers to initiate their own failovers at the storage level, as a direct result of the September 2018 data center outage in the South Central U.S. region. Failover refers to a method used to protect computer systems from failure, in which standby equipment automatically takes over when the main system fails.

“Because it is our policy to prioritize data retention over time-to-restore, we chose to endure a longer outage to ensure that we could restore all customer data successfully,” Russinovich said. “A number of you have told us that you want more flexibility to make this decision for your own organizations, so we are empowering customers by previewing the ability to initiate your own failover at the storage-account level.”

The CTO also discussed Microsoft’s Project Tardigrade, which is an upcoming service intended to detect hardware failures and memory leaks before they happen and freeze affected virtual machines so they can be moved to a different host.

“Continuous, real-time improvement is one of the great advantages of cloud services, and while we will never eliminate all such risks, we are deeply focused on reducing both the frequency and the impact of service issues while being transparent with our customers, partners, and the broader industry,” Russinovich said.

Constellation Research Inc. analyst Holger Mueller said it was good to see Microsoft adding more processes and best practices to make Azure more resilient, as reliability is one of the most important value propositions for cloud computing.

“The most important update is the expansion of its availability zones, as this is one area where Microsoft actually trails other cloud providers,” Mueller said.

Image: bsdrouin/Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.