AI
AI
AI
As AI demands drive orders-of-magnitude increases in token consumption, the need for scalable, production-grade Kubernetes inference has never been greater.
The challenge now is less about training ever-larger models than about running them reliably, cheaply and at scale. In response, Red Hat Inc. has contributed llm-d, an open-source project for running large language models across Kubernetes clusters, to the Cloud Native Computing Foundation, a leading open-source group, as an early-stage community project. The move suggests that distributed Kubernetes inference is moving from experiment to institution-building. The aim is to bring high-end inference into the operating model enterprise IT teams already use, according to Brian Stevens (pictured, right), senior vice president and AI chief technology officer of Red Hat.
“What we realized is that AI is being developed by data scientists, and as part of that, they’re building their own infrastructure to run it on,” Stevens said. “But the way we thought about it was eventually it’s going to be a CIO’s problem. And what language do CIOs speak these days? They speak KubeCon and Kubernetes and Kubernetes-based platforms. So, the challenge we had is how do we build best-in-class inference that’s scalable, manageable [and] delivers to the [surface-level objective] the end users need — but bring it into a Kubernetes platform?”
Stevens and Robert Shaw (left), director of engineering at Red Hat, spoke with theCUBE’s Rebecca Knight and Rob Strechay at KubeCon + CloudNativeCon EU, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They described llm-d as an effort to make inference on Kubernetes faster, more portable and easier to manage across hardware environments. (* Disclosure below.)
As LLMs move from labs into business systems, inference is becoming an operations problem. The llm-d project is designed to optimize clusters of virtual large language model instances, not just single nodes. That matters because enterprises do not only want speed, but speed that survives contact with production: capacity planning, uptime, scaling and all the routine burdens of day-two operations, according to Shaw.
“The focus on performance, and really the reason why performance matters so much for LLM systems, is the L stands for large,” Shaw said. “These models are doing an amount of compute that’s hard to fathom, but when I talk to users of llm-d, they’re not only trying to build a state-of-the-art performance system, they’re also trying to do these day-two operations.”
One of the project’s central ideas is disaggregated serving. It separates the prefill and decode stages of inference into distinct, independently scalable pools. In practice, that gives IT teams more precise control over where latency appears and how resources are allocated, Stevens noted.
“Instead, what we did with disaggregation is we split those apart, and now all of a sudden we can independently scale the processing of the input from the production of the tokens,” he said. “There’s a dial on there that an IT [team] can say, ‘Oh, dial up the performance of the input processing, or … we’re not delivering fast enough on the next tokens, dial that up.'”
The broader point is that inference is starting to resemble the rest of enterprise infrastructure. It needs governance, abstraction and knobs that operators can actually use. The next phase will include multi-tenant model serving, request prioritization, support for newer accelerators and closer alignment with the security demands of agentic systems running on Kubernetes, according to Shaw.
“I couldn’t be more excited about … the next generation of open-source models,” Shaw said. “That’s just gonna drive more and more need for enterprises to really be running these models efficiently at scale for all sorts of applications across their companies.”
Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the KubeCon + CloudNativeCon EU event:
(* Disclosure: Red Hat sponsored this segment of theCUBE. Neither Red Hat nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.