Coverage from SiliconANGLE's livestreaming video studio

UPDATED 15:20 EDT / NOVEMBER 19 2025

Augmented AI memory is reshaping AI economics as WEKA’s new NeuralMesh innovation boosts inference speed, efficiency and scalability.

AI

Weka targets AI memory bottlenecks as inference pushes past DRAM limits

by Victor Dabrinze

Artificial intelligence is running headfirst into a new performance wall, and the pressure is landing squarely on AI memory.

Models are ballooning, multi-turn interactions are piling up and agentic systems are now everyday infrastructure rather than research toys. That shift is exposing a widening gap between what modern inference needs and what traditional DRAM tiers can actually deliver. With persistence and scale becoming the real bottlenecks, WekaIO Inc.’s new augmented memory grid on NeuralMesh steps in as a direct response — aiming to reset expectations around cost, speed and capability.

“What we’ve done with the augmented memory grid is, we’ve taken the durable advantages of Weka’s product called NeuralMesh and we’ve plugged that into inference systems in a supported way,” said Callan Fox (pictured, right), principal product manager, AI inference and data management, at Weka. “What that allows us to do is take the memory tier of DRAM that exists today, as a common one and augment that, extend it into our system. It allows much larger capacities in the memory tier, but at the same speed as DRAM.”

Fox and Betsy Chernoff (left), principal AI product marketing manager at Weka, spoke with theCUBE’s John Furrier at SC25, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed the future of AI memory intelligence, where software will dynamically decide when workloads need speed or deeper reasoning. (* Disclosure below.)

The next frontier of AI memory economics

AI is no longer about single-shot prompts such as “What’s the capital of Arizona?” It’s about coding copilots, reasoning agents and autonomous workflows that require huge context windows and fast, continuous decoding. Those workloads are colliding with the hard limits of GPU and DRAM memory, according to Chernoff.

“Where we see this showing impacting for us is time to first token … reducing latency overall within that time to last token or total token throughput,” she said. “What we’ve seen with multi-turn and concurrent, large context benchmarks that we’ve done, we’ve been able to get to 6x lower, or 6x faster, time to first token. In addition to that, we’ve seen total token output turn into 4.2x, increasing it overall.”

In real infrastructure terms, a single Nvidia H100 GPU costs about $30,000, placing a 100-GPU cluster at around $3 million. A 4.2× throughput boost means the same performance can be delivered with 24 fewer GPUs, saving roughly $720,000 while maintaining output.

“You can certainly think about this as fewer GPUs, but another way to think about this is more throughput,” Chernoff explained. “How much more can you get through this system? That’s really where augmented memory grid … comes into play.”

Weka’s work is far from over. As models grow longer context windows and agentic systems multiply, inference economics will hinge not just on flops, but on the ability to maximize throughput, reduce recomputation and keep GPUs fed efficiently. In a future where software will dynamically decide when workloads need speed or deeper reasoning, memory intelligence will be the backbone of that shift, Chernoff noted.

“As soon as that KV cache gets full, it’s got to go somewhere else, so it ends up hitting the system memory or the DRAM,” she said. “DRAM is also finite; it’s not very large. As soon as that goes away, that’s really where we come into play with augmented memory grid. We are that persistent layer of memory, if you will. While those two may be ephemeral or not be persistent, we provide that persistence in an incredibly meaningful way.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of SC25:

(* Disclosure: WekaIO sponsored this segment of theCUBE. Neither Weka nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.