Nvidia aims to boost Blackwell GPUs by donating platform design to the Open Compute Project
Nvidia Corp. today said it has contributed parts of its Blackwell accelerated computing platform design to the Open Compute Project and broadened support for OCP standards in its Spectrum-X networking fabric.
Nvidia hopes the move will help solidify its new line of Blackwell graphics processing units, which are now in production, as a standard for artificial intelligence and high-performance computing.
In a separate announcement at the OCP Global Summit, Arm Holdings plc announced a collaboration with Samsung Electronics Co. Ltd.’s Foundry, ADTechnology Co. and South Korean ship startup Rebellions Inc. to develop an AI CPU chiplet platform targeted at cloud, HPC and AI/machine learning training and inferencing.
The elements of the GB200 NVL72 system electro-mechanical design (pictured) that Nvidia will share with OCP include the rack architecture, compute and switch tray mechanicals, liquid-cooling and thermal environment specifications and NVLink cable cartridge volumetrics. NVLink is a high-speed interconnect technology Nvidia developed to enable faster communication between GPUs.
The GB200 NVL72 is a liquid-cooled appliance that ships with 36 GB200 accelerators and 72 Blackwell GPUs. The NVLink domain connects them into a single massive GPU that can provide 130 terabytes-per-second low-latency communications.
Built for AI
The GB200 Grace Blackwell Super Chip connects two Blackwell Tensor Core GPUs with an Nvidia Grace CPU. The company said the rack-scale machine can conduct large language model inferencing 30 times faster than the predecessor H100 Tensor Core GPU and is 25 times more energy-efficient.
Nvidia has contributed to OCP for more than a decade, including its 2022 submission of the HGX H100 baseboard design, which is now a de facto standard for AI servers, and its 2023 donation of the ConnectX-7 adapter network interface card design, which is now the foundation design of the OCP Network Interface Card 3.0.
Spectrum-X is an Ethernet networking platform built for AI workloads, particularly in data center environments. It combines Nvidia Spectrum-4 Ethernet switches and its BlueField-3 data processing units for low-latency, high-throughput and efficient networking architecture. Nvidia said it remains committed to offering customers an Infiniband option.
The platform will now support OCP’s Switch Abstraction Interface and Software for Open Networking in the Cloud standards. The Switch Abstraction Interface standardizes how network operating systems interact with network switch hardware. SONiC is a hardware-independent network software layer that is aimed at cloud infrastructure operators, data centers and network administrators.
Nvidia said customers can use Spectrum-X’s adaptive routing and telemetry-based congestion control to accelerate Ethernet performance for scale-out AI infrastructure. ConnectX-8 SuperNIC network interface cards for OCP 3.0 will be available next year, enabling organizations to build more flexible networks.
Taming complexity
“In the last five years, we’ve seen a better than 20,000-fold increase in the complexity of AI models,” said Shar Narasimhan, director of product marketing for Nvidia data center GPUs. “We’re also using richer and larger data sets.” Nvidia has responded with a system design that shards, or fragments, models across clusters of GPUs linked with a high-speed interconnect so that all processors function as a single GPU.
In the GB200 NVL72, each GPU has direct access to every other GPU over a 1.8 terabytes-per-second interconnect. “This enables all of these GPUs to work as a single unified GPU,” Narasimhan said.
Previously, the maximum number of GPUs connected in a single NVLink domain was eight on an HGX H200 baseboard, with a communication speed of 900 gigabits per second. The GB200 NVL72 increased capacity to 72 Blackwell GPUs communicating at 1.8 terabytes per second or 36 times faster than the previous high-end Ethernet standard.
“One of the key elements was using NVSwitch to keep all of the servers and the compute GPUs close together so we could mount them into a single rack,” Narasimhan said. “That allowed us to use copper cabling for NVLink to lower costs and use far less power than fiber optics.”
Nvidia added 100 pounds of steel reinforcement to the rack to accommodate the dense infrastructure and developed quick release plumbing and cabling. The NVLink spine was reinforced to hold up to 5,000 copper cables and deliver 120 kW of power, more than double the load of current rack designs.
“We’re contributing the entire rack with all of the innovation that we performed to reinforce the rack itself, upgrade the NV Links, line cooling and plumbing quick disconnect innovations as well as the manifolds that sit on top of the compute trays and switch trays to deliver direct liquid cooling to each individual tray,” Narasimhan said.
The Arm-led project will combine Rebellions’ Rebel AI accelerator with ADTechnology’s Neoverse CSS V3-powered compute chiplet implemented with Samsung Foundry two-nanometer Gate-All-Around advanced process technology. The companies said the chiplet will deliver two to three times the performance and power efficiency of competing architectures when running generative AI workloads. Rebellions earlier this year raised $124 million to fund its engineering efforts as it takes on Nvidia for AI processing.
Photo: Nvidia
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU