UPDATED 14:57 EST / NOVEMBER 20 2024

Hasan Siraj, head of software products, ecosystem, at Broadcom talks to theCUBE about AI and clustered systems at SC24. AI

From servers to clusters: theCUBE takes a look at how Broadcom is using AI to reshape IT infrastructure

Clustered systems are emerging as the backbone of artificial intelligence infrastructure, transforming how industries manage the immense demands of training large language models and executing machine learning tasks.

As traditional computing frameworks fall short, these interconnected networks of servers are driving a new era of efficiency and scalability. However, the success of these systems hinges on strong networking solutions, which serve as the critical foundation for ensuring seamless operations and unlocking AI’s full potential. Without such innovations, even the most advanced AI systems risk falling short of their promise, according to Hasan Siraj (pictured), head of software products, ecosystem, at Broadcom Inc.

Hasan Siraj, head of software products, ecosystem, at Broadcom talks to theCUBE about AI and clustered systems at SC24.

Broadcom’s Hasan Siraj talks to theCUBE about AI and clustered systems.

“There [are] people who know how to manage Ethernet-based networks,” Siraj said. “There are troubleshooting tools, monitoring tools that are available. Whenever you’re building an AI network, you have a front end, a backend, a storage and an outband management network that’s all Ethernet. It’s a standard way of managing all of it.”

Siraj spoke with theCUBE Research’s John Furrier and Dave Vellante at SC24, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed how Broadcom is using clustered systems and advanced networking solutions to change AI infrastructure to meet the growing demands of training large language models and executing machine learning tasks. (* Disclosure below.)

AI-driven networking as the backbone of clustered systems

Broadcom is playing a leading role in enabling next-generation clustered systems, particularly for hyperscalers building massive AI-driven infrastructures. Networking, though often overlooked, is pivotal in connecting thousands — potentially millions — of GPUs across clusters, according to Siraj.

“If you are training a large model and these models are growing at an exponential, they don’t fit in a CPU, and a core of a CPU, virtualization is no play,” he explained. “This is why you cannot fit a model within a server or two servers or four servers. That is why you need a cluster. When you have a cluster and everything is spread out, you need glue to put this all together. That is networking.”

Networking in this context is not merely a cost center but the linchpin of operational success. If the network falters — whether through latency, bandwidth limitations or failure to recover from hardware faults — the entire AI training process can be compromised, leading to costly inefficiencies and delays, Siraj said.

“Ethernet is now becoming kind of the defacto standard for the scale out network,” he explained. “There are the largest clusters out there, a hundred thousand GPUs, which are based on this meta. Predominantly the world is moving there. You need to be able to cater to very small packets at very high throughput. You need to have what we call linked level retries. You need to be able to optimize the headers to utilize the bandwidth effectively.”

Broadcom is innovating across multiple levels of the technology stack, from network silicon and systems to software optimizations. These developments ensure that AI clusters can keep pace with the exponential growth in data and computational requirements, Siraj added.

“You’ve got to innovate at all levels of the stack,” he said. “We build up network silicon, we’ve got to do a lot of innovation there. There’s innovation on the system level, the innovation that needs to happen on the software side.”

The AI revolution demands a rethink of traditional IT architectures, with networking emerging as the keystone for enabling cutting-edge clustered systems. As companies such as Broadcom push the boundaries of what networking can achieve, the promise of democratizing supercomputing capabilities for AI is becoming a tangible reality, driving advancements that will define the future of technology.

The [Remote Direct Memory Access] was useful in InfiniBand, but if you want to go to this bigger scale you need to be able to fix it,” Siraj said. “You need to bring multipathing to your point out of our replacement, selectively transmit. People are standardizing on these kind of implementations. People have a standard implementation which can allow them to scale down the road no matter what size clusters they’ve gotten.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE Research’s coverage of SC24:

(* Disclosure: TheCUBE is a paid media partner for SC24. Neither Dell Technologies Inc. and WekaIO Inc., the premier sponsors of theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU