UPDATED 17:00 EST / NOVEMBER 12 2025

Kelsey Hightower, Eddie Villalba and Akshay Ram of Google Cloud discuss GKE Inference with theCUBE at KubeCon + CloudNativeCon North America 2025. CLOUD

Google ramps up GKE inference for faster, cheaper Kubernetes AI

Google Kubernetes Engine is moving from hype to hardened practice as teams chase lower latency, higher throughput and portability. In fact, the GKE inference conversation has moved away from feasibility and toward codification — locking in standardized production patterns for model serving.

Kelsey Hightower, Eddie Villalba and Akshay Ram of Google Cloud discuss GKE Inference with theCUBE at KubeCon + CloudNativeCon North America 2025

Google Cloud’s Akshay Ram, Kelsey Hightower and Eddie Villalba explore GKE inference with theCUBE.

In Kubernetes, that means capturing real, oftentimes unpredictable behaviors and turning them into consistent application programming interfaces that maximize accelerator utilization from day zero, according to Kelsey Hightower (pictured, center), distinguished engineer at Google Cloud. That messy practice hardens into a programmable contract for enterprises.

“Inference is still something people are trying to perfect,” he said. “We’ve seen people try to do it the old way and in better ways, but what Kubernetes allows you to do is take the practice and turn it into an API.”

Hightower, Eddie Villalba (right), outbound product manager for artificial intelligence on GKE at Google Cloud, and Akshay Ram (left), group product manager at Google Cloud, spoke with theCUBE’s Savannah Peterson at the KubeCon + CloudNativeCon NA event, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed recent GKE inference announcements and day-zero performance and cost optimization. (* Disclosure below.)

How enterprises drive performance with GKE inference

Inference workloads have quickly become the proving ground for Kubernetes innovation. Teams must find ways to balance throughput, latency and portability while managing increasingly heterogeneous hardware. That evolution has sparked deep collaboration between open-source communities and hyperscalers to make inference frameworks more efficient, Ram explained.

“We’re working with a lot of communities to really think about inference in a way which is reasonably hardware-agnostic,” he said. “It takes some work, but it’s getting there.”

With these inference workloads, predictability is out and variability is in. Requests can range from short, frequently asked questions to long document summaries, stressing backends in very different ways. That pushes platform teams to standardize scheduling, load-balancing and accelerator awareness while keeping costs in check, according to Ram.

“Inference should just be the new microservice,” he said. “It’s just another workload, and people should really be talking about the value they get out of it in terms of the productivity gains, in terms of how it’s transforming their organization and how they’re doing it at low cost and low scale.”

That framing depends on fundamentals – treating inference as a Kubernetes-native serving pattern, not a one-off experiment – so teams can reuse autoscaling contracts, traffic management and observability while swapping in new model servers or accelerators. For practitioners, the message is pragmatic: master the core Kubernetes resource model and controller loop, then apply it to inference at your own pace and price point, Villalba shared.

“Inference is just another workload. [A] highly-specialized serving workload for sure, but it’s just another workload,” he said. “If I can get those principles of what it means to scale a web service to millions of pods and containers and so on, the next step is, okay, now I just need to know where those other things run from – but it’s the same kind of concept. Different scale, different economics, but it’s the same.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the KubeCon + CloudNativeCon NA event:

(* Disclosure: Google Cloud sponsored this segment of theCUBE. Neither Google Cloud nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.