UPDATED 14:23 EST / FEBRUARY 11 2014

NEWS

Microsoft on: Designing cloud infrastructure for 1m+ server scale | #OCPSummit

by Valentina Craft

Day one of the Open Compute Project Summit V reunited on stage tech athletes from iconic companies who challenged each-other in delivering breaking news, breakthrough products and powerful presentations, highlighting the trends in the industry.

Kushagra Vaid, General Manager of Cloud Server Engineering with Microsoft, shared with the audience the secrets of “Designing Cloud Infrastructure for 1m+ Server Scale.” He started by sharing the news that Microsoft is finally joining OCP, contributing the cloud server design specifications to open compute.

Vaid went through some details regarding some of the principles behind the design and the key features of designing for an infrastructure of a million servers.

“One of the key considerations in designing our cloud infrastructure was to address the challenge of designing a common architecture that can take into account the requirements of a diverse set of applications,” he said.

For example, Bing is a very compute-intensive workload, very heavy on the CPU, outlook.com is very storage heavy and some other applications are network heavy.

When designing this infrastructure, the first challenge was having a modular and flexible architecture and the second challenge was addressing the data centers.

Microsoft has data centers scattered across the globe, which are megascale, (60 megawatts or larger in one facility). Therefore, deployments across the globe require a supply chain.

Kushagra detailed the server scale implications, focusing on design, supply chain and operations.

“On the design, when you deploy over a million servers you have too keep in mind that you cannot have a large number of SKUs. At this level the hardware reliability becomes less important, and the software is the key entity providing reliability for the application. That allows for a lot of simplifications in the design of the hardware itself.”

Talking about the server scale implications in the supply chain, Vain commented: “For a big deployment it is very inefficient to deploy in small chunks; in our case it can be around 10,000 servers in one chunk. The ecosystem engagement model for how do you source and manufacture components, how do you integrate, assemble and deliver components is very different as well: direct negotiations with component suppliers, consigning components to the manufacturers, and the manufacturing material also becomes a factor to consider,” he explained.

“The last and most important aspect is how you operate a fleet of over a million servers. The operations model needs to scale and you also have to keep in mind that they are deployed globally. In contrast with a traditional IT enterprise environment, when a server goes down, you don’t have to have a 24/7 staff to replace it. In Microsoft’s case it can take from one week to two weeks until a server is replaced or serviced. It’s a very low touch model, and the software that is operating in the infrastructure takes care of ensuring that the workload can be shifted to a different part of infrastructure if a part of a server goes down. The service availability is met and the end-user doesn’t see any impact.”

“The model is particularly efficient when operating a large server fleet,” summarized Vaid.

Infrastructure-wise, it really comes down to three things, explained Vaid:

.

Standardization & Modularization
Design Simplicity
Operations Excellence

.

Some of the key features are:

.

shared infrastructure for efficiency and TCO optimization
Blind-mated signal connectivity
Network and storage cabling via backplane architecture
Secure & scalable systems management

.

As part of joining OCP, Microsoft is contributing four main things:

.

Source Code (Chassis management source code through Open Source)
Specifications (Chassis, Blade, Chassis Manager, Mezzanines, Management APIs)
Mechanical CAD Models (Chassis, Blade, Chassis Manager, Mezzanines)
Board Files & Gerbers (Chassis Manager, Tray Backplane, Power Distribution Backplane)

.

As for Microsoft’s goals for the future, they are carefully detailed in this slide and include: modularity with feature simplicity, power efficiency, technology innovation, low cost mass production, scalable systems management, HW and SW security, reduced operator errors, diagnostics and self-healing, environmental sustainability.

The entire presentation can be viewed here.

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.