UPDATED 11:10 EDT / NOVEMBER 20 2025

Shimon Ben-David, CTO of WekaIO and Dion Harris, Sen. Dir. at Nvidia spoke with theCUBE about AI inferencing during SC25.

Nvidia and Weka target rising memory limits as AI inferencing scales

A new performance wall is emerging as next-generation systems strain under skyrocketing model demands, especially when artificial intelligence workloads push AI inferencing engines beyond what conventional memory tiers can handle.

WekaIO Inc.’s newest release, built in collaboration with Nvidia Corp., takes direct aim at that constraint with a redesigned memory extension layer. By streaming key-value cache data between GPU memory and Weka’s token warehouse at near-memory speeds, it attempts to break the long-standing tradeoff between capacity and performance.

“We’re talking about KV cache acceleration,” said Shimon Ben-David (pictured, left), chief technology officer of WekaIO. “If we accelerate inferencing 4, 10, 20, 40 times faster, imagine the amount of tokens that you can generate and the amount of outcomes. AI is not a problem, you just get an outcome that is very powerful for our customers as a result.”

Ben-David spoke with theCUBE’s John Furrier and Dave Vellante at SC25, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. He was joined by Dion Harris (right), senior director of HPC, cloud and AI infrastructure solutions GTM at Nvidia. They discussed the work of both companies on a range of AI infrastructure and high-performance computing solutions. (* Disclosure below.)

AI inferencing at scale

Managing data properly in KV cache is an important process because large language models rely heavily on this stored information to understand and contextualize input prompts. Weka’s solution integrates closely with Nvidia Dynamo and Nvidia NIXL to move KV cache blocks between GPU memory and external storage without disrupting AI inferencing.

“Dynamo is really about delivering AI inferencing at scale, but across the entire tiered memory,” Harris explained. “Through Dynamo, we’ve exposed a new protocol, it’s called NIXL, which is Nvidia Inference Transfer Library. That allows us to expose this sort of hierarchy to our storage partners like Weka. They’re able to then immediately have this integration across the full orchestration.”

In addition to addressing the KV cache bottleneck, Nvidia has been working on high-performance computing solutions to power AI. At SC24 a year ago, the company announced the H200 NVL, a data center-grade GPU with greater memory efficiency and a 1.2 times bandwidth increase designed to boost retrieval augmented generation or RAG pipelines, Harris noted.

“KV cache is key, but there’s also RAG,” he said. “Last year, when we were here…we described our integration to help facilitate RAG workflows where you’re taking proprietary data and you are using that to boost the intelligence, reduce hallucinations. That’s a core part of the data pipeline that is actually key for AI. Beyond AI, in fact, we’re at supercomputing.”

The enhancements announced by Nvidia and Weka at SC25 this week highlight a rapidly-evolving environment where compute infrastructure is being developed to enable the promise of AI. Weka is already beginning to see a string of AI applications among customers, according to Ben-David.

“We’ve seen an adoption of a few very interesting use cases,” he said. “Everybody’s exploring AI, from chatbots to RAG environments, semantic searches and just raw inference accelerations. We’ve seen use cases such as video search and summarizations. We’re now seeing more and more physical intelligence where robotics environments are being generated.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of SC25:

(* Disclosure: WekaIO sponsored this segment of theCUBE. Neither WekaIO nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Nvidia and Weka target rising memory limits as AI inferencing scales

AI inferencing at scale

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

MWC Barcelona 2026

Vast Forward 2026

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

Nvidia and Weka target rising memory limits as AI inferencing scales

AI inferencing at scale

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

MWC Barcelona 2026

Vast Forward 2026

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

Cookies