UPDATED 13:39 EDT / MARCH 27 2025

AI

Nebius’ GTC session highlights best practices for building an AI cloud platform

Two words are practically ubiquitous in technology discussions these days: artificial intelligence and cloud. At last week’s Nvidia GTC conference in San Jose, many emerging technology companies discussed and demonstrated how they leverage AI and the cloud to deliver innovative products and solutions to their customers.

One interesting case study came from Nebius, a Netherlands-based AI full-stack infrastructure company, which is one of only a handful of Reference Architecture Nvidia Cloud Partners. Gleb Kholodov, head of foundational services, and Oleg Federov, head of hardware R&D, delivered an interesting presentation titled, “From zero to scale: How to build an efficient AI cloud from scratch.

As the AI era moves from the domain of the hyperscalers to other organizations, having best practices from a company that built an AI cloud will be useful. They walked through the company’s process of building an AI cloud business from the ground up. Nebius went from concept to a fully operational system running tens of thousands of Nvidia GPUs connected via a 400-gigabits-per-second InfiniBand network in just one year.

Here are some notable points from the session:

Getting started — and a fast setback

Nebius was formed from the break-up of Russian company Yandex. Building AI clouds wasn’t the company’s original intention. “At first, we thought we’d be building sovereign clouds, but then ChatGPT really took off, and we decided to pivot and power this emerging AI gold rush instead,” explained Kholodov. An immediate challenge – the limited license granted under the Yandex break-up deal to the cloud stack the Nebius team had helped build in their previous lives – became a blessing in disguise for Kholodov, Federov and their colleagues.

“We had exactly one year to rebuild the entire platform — in high quality — or shut down,” recalled Kholodov. “It was a chance to change our mindset, rethink our priorities, modernize our tech stack and decide on our values and what we’re optimizing for and really reflect that in our design. The time pressure — pretty immense, I would say — kept us focused and helped us cut down the non-essentials.”

Smaller regions — and more of them

Nebius pivoted from its original plan to “deploy a few fully independent bigger regions complete with three availability zones and tons of services,” said Kholodov, because though that approach worked reasonably well for sovereign clouds, “for AI, it just did not cut it.” In a 180-degree move, Nebius changed its focus to building many smaller regions that would be fully independent in terms of fault tolerance, data residency and the like, but interconnected from a management perspective so clients could manage all resources from a single web console.

Adapting to its new model required Nebius to deploy one region every quarter. “To achieve that, we had to architect our cloud for speed of deployment and operational efficiency,” Kholodov recalled. “Deploying new regions fast is not just about software. The hardware needs to get there first, be installed quickly and serviced efficiently.”

In the AI era, innovation moves faster than ever, and the lessons learned from Nebius is something other organizations can leverage as they look to scale their AI infrastructure plans.

Four key goals

Federov, the hardware lead, said the focused on four “really important things”:

  1. Sustainability of server specification: Thermal and power efficiency, fast deployment, not by a single server, but by several racks, modules, and data centers as well as easy-to-change firmware for any of the components. Federov said the team needed quick fixes for problems such as security threats. “If we needed a new functionality, we implemented it quickly.” He said their love of F1 auto racing inspired the team. “We thought of maintaining our servers as F1 pit stops, so it should be quick, safe and easy.”
  2. Efficient design: Its server design enabled Nebius to maintain them “with one hand, because the other one, in the data center is always occupied by a laptop so you can see the task you are doing.” This statement is obviously a euphemism but the point on efficiency is one that has been overlooked in data center design in the past.
  3. Optimal flow control: To model airflow inside the servers, Nebius uses software like what F1 Racing uses to understand how air flows around a car. This is how Nebius developed optimizations like different plenums of air for CPU and GPU power, dual and single rotor fans for different parts of the serve and its own implementation of PID algorithms.
  4. Design efficiency: Nebius designed its servers to be highly efficient. “Our servers require up to 23% less energy on a full load,” said Federov. An added bonus of this approach was less noise pollution, which made it easier for engineers to communicate, reducing errors and delivering better service-level agreements.

Building servers like LEGOs

Federov and his team leveraged the concept of LEGOs to assemble large, modern AI server racks based on the Open Compute Project. “We were happy to know that Nvidia GB200 servers use OCP,” he said. “That’s how our vision, even on hardware, is aligned. We build not only servers and racks; we build data centers. This is the only way we can make the most of our hardware optimization, reach the desired efficiency, and incorporate our sustainability principles.”

Adjusting on the fly to succeed in AI

Even though Nebius had to rewrite its cloud plan, the company’s mission remained constant: “Give high-quality AI infrastructure and services to customers of all types and sizes, at an affordable price, and on terms that fit their clients’ needs the best, be it reserve, on-demand or spot,” said Kholodov. He said the company’s original stack concept, aimed at the average cloud user, needed a lot of services, managed databases and multiple types of VPNs. But AI required a different approach. “To succeed in this AI market, we needed to focus and slow down the offering,” he explained. The team had to rethink what it was — and what it wasn’t — and “shake off our megalomania of trying to be the only cloud that you ever need and instead aspire to become the best cloud for all your AI needs,” Kholodov said.

Building the cloud

After sorting out the hardware, Nebius needed to build the cloud on top of it — in just a year. To meet its aggressive goal, the company had to reduce complexity. This meant simplifying things with design choices and infrastructure, avoiding circular dependencies and being ready to use whatever tools are available in the market to meet their aggressive timeline.

“We knew we would be learning as we go,” he said. “We knew that some choices that we make in the beginning, while they’ll definitely be helpful to lift us off of the ground, may not be the right choices that will help us scale. We needed to retain the utmost flexibility by being able to change anything we needed under the hood without impacting the higher levels that customers are exposed to.”

Choosing Kubernetes

While some of its services could operate on top of a hardware-as-a-service design, for the bulk of them, Nebius needed a higher-level platform, so the company decided to go with Kubernetes. “Kubernetes is not exactly your typical go-to choice for building public clouds,” Kholodov said. “It’s primarily for containers. It has some scalability to it, and it’s convenient.” For the data plane, Nebius deployed a virtualization stack with three pillars: virtualization of compute, network and storage.

“Within three months, we were able to hit the ground running,” said Kholodov, “launching our first VM on the freshly installed Kubernetes. And that unlocked all the development on the higher levels. We still had to customize it pretty heavily, but we’re not afraid to do that for any of the components in the stack; we have engineers who can touch every single layer.”

Not finding anything they liked on the market, the Nebius team wrote its own container network interfaces to give its customers the utmost control over their virtual networks. For Nebius customers, compute is just virtual machines, as the complexity of the underlying Kubernetes is masked by the software.

Close working relationship with Nvidia

With just a year to bring its AI cloud to market, Federov said the Nebius team had to be creative and stay focused. “We had to cut the nonessentials, but we did not cut corners,” he said. “Our cloud adheres to Nvidia’s reference architecture, and that was recently acknowledged by Nvidia, which granted us the status of reference platform Nvidia cloud partner” for Nebius’s competencies, including compute, Nvidia AI, networking, visualization and Nvidia virtual desktops.

What the Nebius team learned along the way

“We came from the world of VMs, of selling compute power,” Kholodov said. “In the AI world, especially with AI trading, people don’t want to buy just compute. They want to buy compute that directly contributes to the progress of their model training, with the clusters continuing to get bigger. In fact, some of our customers run clusters as big as 4000 GPUs.”

Federov added that in addition to designing hardware, how it is produced and tested is critical. “It all starts in the factories,” he said. “It’s important to make a small data center right in front of the assembly lines  so you can apply special environmental conditions [temperature, humidity, and more] and the latest firmware for all components. We add specialized testing toolsets like Nvidia 3DMark, for example, for GPUs. It’s important to try to mimic client workloads in the factory — how they specifically use the hardware at this particular stage.”

Lessons learned: Change is constant in the AI era

The most important takeaway from the Nebius session is that change does not stop and it’s important to embrace it. The Nebius team faced so many changes over the past year with success coming from its ability to be adaptable and resilient. Kholodov discussed how Nebius initially tried to avoid unpredictability, as we all do, but quick realized that change and unpredictability needs to be baked into their plans now and anything it brings to market.

Information technology executives in charge of AI projects need to have the same mindset. Everything in the AI ecosystem – hardware, software, policies, people and more — is unpredictable. Embrace this and change with it and adapt be ready for whatever comes – that’s the only path to AI success.

Zeus Kerravala is a principal analyst at ZK Research, a division of Kerravala Consulting. He wrote this article for SiliconANGLE.

Photo: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU