UPDATED 08:01 EDT / MARCH 24 2026

Brian Stevens, SVP and AI chief technology officer of Red Hat, and Robert Shaw, director of engineering at Red Hat, talk to theCUBE about Kubernetes inference at KubeCon + CloudNativeCon EU 2026 AI

Red Hat sees inference as AI’s next battleground — with Kubernetes at the core

As AI demands drive orders-of-magnitude increases in token consumption, the need for scalable, production-grade Kubernetes inference has never been greater.

The challenge now is less about training ever-larger models than about running them reliably, cheaply and at scale. In response, Red Hat Inc. has contributed llm-d, an open-source project for running large language models across Kubernetes clusters, to the Cloud Native Computing Foundation, a leading open-source group, as an early-stage community project. The move suggests that distributed Kubernetes inference is moving from experiment to institution-building. The aim is to bring high-end inference into the operating model enterprise IT teams already use, according to Brian Stevens (pictured, right), senior vice president and AI chief technology officer of Red Hat.

“What we realized is that AI is being developed by data scientists, and as part of that, they’re building their own infrastructure to run it on,” Stevens said. “But the way we thought about it was eventually it’s going to be a CIO’s problem. And what language do CIOs speak these days? They speak KubeCon and Kubernetes and Kubernetes-based platforms. So, the challenge we had is how do we build best-in-class inference that’s scalable, manageable [and] delivers to the [surface-level objective] the end users need — but bring it into a Kubernetes platform?”

Stevens and Robert Shaw (left), director of engineering at Red Hat, spoke with theCUBE’s Rebecca Knight and Rob Strechay at KubeCon + CloudNativeCon EU, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They described llm-d as an effort to make inference on Kubernetes faster, more portable and easier to manage across hardware environments. (* Disclosure below.)

Kubernetes inference becomes an enterprise systems concern

As LLMs move from labs into business systems, inference is becoming an operations problem. The llm-d project is designed to optimize clusters of virtual large language model instances, not just single nodes. That matters because enterprises do not only want speed, but speed that survives contact with production: capacity planning, uptime, scaling and all the routine burdens of day-two operations, according to Shaw.

“The focus on performance, and really the reason why performance matters so much for LLM systems, is the L stands for large,” Shaw said. “These models are doing an amount of compute that’s hard to fathom, but when I talk to users of llm-d, they’re not only trying to build a state-of-the-art performance system, they’re also trying to do these day-two operations.”

One of the project’s central ideas is disaggregated serving. It separates the prefill and decode stages of inference into distinct, independently scalable pools. In practice, that gives IT teams more precise control over where latency appears and how resources are allocated, Stevens noted.

“Instead, what we did with disaggregation is we split those apart, and now all of a sudden we can independently scale the processing of the input from the production of the tokens,” he said. “There’s a dial on there that an IT [team] can say, ‘Oh, dial up the performance of the input processing, or … we’re not delivering fast enough on the next tokens, dial that up.'”

The broader point is that inference is starting to resemble the rest of enterprise infrastructure. It needs governance, abstraction and knobs that operators can actually use. The next phase will include multi-tenant model serving, request prioritization, support for newer accelerators and closer alignment with the security demands of agentic systems running on Kubernetes, according to Shaw.

“I couldn’t be more excited about … the next generation of open-source models,” Shaw said. “That’s just gonna drive more and more need for enterprises to really be running these models efficiently at scale for all sorts of applications across their companies.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the KubeCon + CloudNativeCon EU event:

(* Disclosure: Red Hat sponsored this segment of theCUBE. Neither Red Hat nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.