Microsoft on: Designing cloud infrastructure for 1m+ server scale | #OCPSummit
Day one of the Open Compute Project Summit V reunited on stage tech athletes from iconic companies who challenged each-other in delivering breaking news, breakthrough products and powerful presentations, highlighting the trends in the industry.
Kushagra Vaid, General Manager of Cloud Server Engineering with Microsoft, shared with the audience the secrets of “Designing Cloud Infrastructure for 1m+ Server Scale.” He started by sharing the news that Microsoft is finally joining OCP, contributing the cloud server design specifications to open compute.
Vaid went through some details regarding some of the principles behind the design and the key features of designing for an infrastructure of a million servers.
“One of the key considerations in designing our cloud infrastructure was to address the challenge of designing a common architecture that can take into account the requirements of a diverse set of applications,” he said.
For example, Bing is a very compute-intensive workload, very heavy on the CPU, outlook.com is very storage heavy and some other applications are network heavy.
When designing this infrastructure, the first challenge was having a modular and flexible architecture and the second challenge was addressing the data centers.
Microsoft has data centers scattered across the globe, which are megascale, (60 megawatts or larger in one facility). Therefore, deployments across the globe require a supply chain.
Kushagra detailed the server scale implications, focusing on design, supply chain and operations.
“On the design, when you deploy over a million servers you have too keep in mind that you cannot have a large number of SKUs. At this level the hardware reliability becomes less important, and the software is the key entity providing reliability for the application. That allows for a lot of simplifications in the design of the hardware itself.”
Talking about the server scale implications in the supply chain, Vain commented: “For a big deployment it is very inefficient to deploy in small chunks; in our case it can be around 10,000 servers in one chunk. The ecosystem engagement model for how do you source and manufacture components, how do you integrate, assemble and deliver components is very different as well: direct negotiations with component suppliers, consigning components to the manufacturers, and the manufacturing material also becomes a factor to consider,” he explained.
“The last and most important aspect is how you operate a fleet of over a million servers. The operations model needs to scale and you also have to keep in mind that they are deployed globally. In contrast with a traditional IT enterprise environment, when a server goes down, you don’t have to have a 24/7 staff to replace it. In Microsoft’s case it can take from one week to two weeks until a server is replaced or serviced. It’s a very low touch model, and the software that is operating in the infrastructure takes care of ensuring that the workload can be shifted to a different part of infrastructure if a part of a server goes down. The service availability is met and the end-user doesn’t see any impact.”
“The model is particularly efficient when operating a large server fleet,” summarized Vaid.
Infrastructure-wise, it really comes down to three things, explained Vaid:
.
- Standardization & Modularization
- Design Simplicity
- Operations Excellence
.
Some of the key features are:
.
- shared infrastructure for efficiency and TCO optimization
- Blind-mated signal connectivity
- Network and storage cabling via backplane architecture
- Secure & scalable systems management
.
As part of joining OCP, Microsoft is contributing four main things:
.
- Source Code (Chassis management source code through Open Source)
- Specifications (Chassis, Blade, Chassis Manager, Mezzanines, Management APIs)
- Mechanical CAD Models (Chassis, Blade, Chassis Manager, Mezzanines)
- Board Files & Gerbers (Chassis Manager, Tray Backplane, Power Distribution Backplane)
.
As for Microsoft’s goals for the future, they are carefully detailed in this slide and include: modularity with feature simplicity, power efficiency, technology innovation, low cost mass production, scalable systems management, HW and SW security, reduced operator errors, diagnostics and self-healing, environmental sustainability.
The entire presentation can be viewed here.
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU