

Two words are practically ubiquitous in technology discussions these days: artificial intelligence and cloud. At last week’s Nvidia GTC conference in San Jose, many emerging technology companies discussed and demonstrated how they leverage AI and the cloud to deliver innovative products and solutions to their customers.
One interesting case study came from Nebius, a Netherlands-based AI full-stack infrastructure company, which is one of only a handful of Reference Architecture Nvidia Cloud Partners. Gleb Kholodov, head of foundational services, and Oleg Federov, head of hardware R&D, delivered an interesting presentation titled, “From zero to scale: How to build an efficient AI cloud from scratch.”
As the AI era moves from the domain of the hyperscalers to other organizations, having best practices from a company that built an AI cloud will be useful. They walked through the company’s process of building an AI cloud business from the ground up. Nebius went from concept to a fully operational system running tens of thousands of Nvidia GPUs connected via a 400-gigabits-per-second InfiniBand network in just one year.
Here are some notable points from the session:
Nebius was formed from the break-up of Russian company Yandex. Building AI clouds wasn’t the company’s original intention. “At first, we thought we’d be building sovereign clouds, but then ChatGPT really took off, and we decided to pivot and power this emerging AI gold rush instead,” explained Kholodov. An immediate challenge – the limited license granted under the Yandex break-up deal to the cloud stack the Nebius team had helped build in their previous lives – became a blessing in disguise for Kholodov, Federov and their colleagues.
“We had exactly one year to rebuild the entire platform — in high quality — or shut down,” recalled Kholodov. “It was a chance to change our mindset, rethink our priorities, modernize our tech stack and decide on our values and what we’re optimizing for and really reflect that in our design. The time pressure — pretty immense, I would say — kept us focused and helped us cut down the non-essentials.”
Nebius pivoted from its original plan to “deploy a few fully independent bigger regions complete with three availability zones and tons of services,” said Kholodov, because though that approach worked reasonably well for sovereign clouds, “for AI, it just did not cut it.” In a 180-degree move, Nebius changed its focus to building many smaller regions that would be fully independent in terms of fault tolerance, data residency and the like, but interconnected from a management perspective so clients could manage all resources from a single web console.
Adapting to its new model required Nebius to deploy one region every quarter. “To achieve that, we had to architect our cloud for speed of deployment and operational efficiency,” Kholodov recalled. “Deploying new regions fast is not just about software. The hardware needs to get there first, be installed quickly and serviced efficiently.”
In the AI era, innovation moves faster than ever, and the lessons learned from Nebius is something other organizations can leverage as they look to scale their AI infrastructure plans.
Federov, the hardware lead, said the focused on four “really important things”:
Federov and his team leveraged the concept of LEGOs to assemble large, modern AI server racks based on the Open Compute Project. “We were happy to know that Nvidia GB200 servers use OCP,” he said. “That’s how our vision, even on hardware, is aligned. We build not only servers and racks; we build data centers. This is the only way we can make the most of our hardware optimization, reach the desired efficiency, and incorporate our sustainability principles.”
Even though Nebius had to rewrite its cloud plan, the company’s mission remained constant: “Give high-quality AI infrastructure and services to customers of all types and sizes, at an affordable price, and on terms that fit their clients’ needs the best, be it reserve, on-demand or spot,” said Kholodov. He said the company’s original stack concept, aimed at the average cloud user, needed a lot of services, managed databases and multiple types of VPNs. But AI required a different approach. “To succeed in this AI market, we needed to focus and slow down the offering,” he explained. The team had to rethink what it was — and what it wasn’t — and “shake off our megalomania of trying to be the only cloud that you ever need and instead aspire to become the best cloud for all your AI needs,” Kholodov said.
After sorting out the hardware, Nebius needed to build the cloud on top of it — in just a year. To meet its aggressive goal, the company had to reduce complexity. This meant simplifying things with design choices and infrastructure, avoiding circular dependencies and being ready to use whatever tools are available in the market to meet their aggressive timeline.
“We knew we would be learning as we go,” he said. “We knew that some choices that we make in the beginning, while they’ll definitely be helpful to lift us off of the ground, may not be the right choices that will help us scale. We needed to retain the utmost flexibility by being able to change anything we needed under the hood without impacting the higher levels that customers are exposed to.”
While some of its services could operate on top of a hardware-as-a-service design, for the bulk of them, Nebius needed a higher-level platform, so the company decided to go with Kubernetes. “Kubernetes is not exactly your typical go-to choice for building public clouds,” Kholodov said. “It’s primarily for containers. It has some scalability to it, and it’s convenient.” For the data plane, Nebius deployed a virtualization stack with three pillars: virtualization of compute, network and storage.
“Within three months, we were able to hit the ground running,” said Kholodov, “launching our first VM on the freshly installed Kubernetes. And that unlocked all the development on the higher levels. We still had to customize it pretty heavily, but we’re not afraid to do that for any of the components in the stack; we have engineers who can touch every single layer.”
Not finding anything they liked on the market, the Nebius team wrote its own container network interfaces to give its customers the utmost control over their virtual networks. For Nebius customers, compute is just virtual machines, as the complexity of the underlying Kubernetes is masked by the software.
With just a year to bring its AI cloud to market, Federov said the Nebius team had to be creative and stay focused. “We had to cut the nonessentials, but we did not cut corners,” he said. “Our cloud adheres to Nvidia’s reference architecture, and that was recently acknowledged by Nvidia, which granted us the status of reference platform Nvidia cloud partner” for Nebius’s competencies, including compute, Nvidia AI, networking, visualization and Nvidia virtual desktops.
“We came from the world of VMs, of selling compute power,” Kholodov said. “In the AI world, especially with AI trading, people don’t want to buy just compute. They want to buy compute that directly contributes to the progress of their model training, with the clusters continuing to get bigger. In fact, some of our customers run clusters as big as 4000 GPUs.”
Federov added that in addition to designing hardware, how it is produced and tested is critical. “It all starts in the factories,” he said. “It’s important to make a small data center right in front of the assembly lines so you can apply special environmental conditions [temperature, humidity, and more] and the latest firmware for all components. We add specialized testing toolsets like Nvidia 3DMark, for example, for GPUs. It’s important to try to mimic client workloads in the factory — how they specifically use the hardware at this particular stage.”
The most important takeaway from the Nebius session is that change does not stop and it’s important to embrace it. The Nebius team faced so many changes over the past year with success coming from its ability to be adaptable and resilient. Kholodov discussed how Nebius initially tried to avoid unpredictability, as we all do, but quick realized that change and unpredictability needs to be baked into their plans now and anything it brings to market.
Information technology executives in charge of AI projects need to have the same mindset. Everything in the AI ecosystem – hardware, software, policies, people and more — is unpredictable. Embrace this and change with it and adapt be ready for whatever comes – that’s the only path to AI success.
Zeus Kerravala is a principal analyst at ZK Research, a division of Kerravala Consulting. He wrote this article for SiliconANGLE.
THANK YOU