AI
AI
AI
Artificial intelligence inference is entering a new era defined not by compute alone, but by an escalating demand for context memory that traditional storage architectures were never designed to handle.
Inference didn’t hit a compute wall — it hit a context memory wall. As AI workloads evolve from single-shot prompts to multi-turn, agentic sessions with million-token context windows, the volume of key-value cache data is swelling into the petabytes, outpacing what GPU and DRAM memory tiers can absorb. The global NAND shortage has moved from a supply-chain talking point to a material operational risk for organizations with high AI workloads. The challenge is reshaping how storage companies approach AI factory design, according to Betsy Chernoff (pictured, left), principal AI and product marketing manager at WekaIO Inc.
“If you think about it from a level of where we started from even a year ago, people were just doing single shot prompts,” Chernoff said. “But as we’ve grown, you’ve seen things like multi-turn, concurrency, many users, many different rounds of conversations. Then, in addition to that, the context lengths themselves have grown. All of these have exponentially increased the amount of memory required for these systems.”
Chernoff and Ace Stryker (right), director of AI marketing and ecosystem at Solidigm, a trademark of SK hynix NAND Product Solutions Corp., spoke with theCUBE’s Gemma Allen at the Nvidia GTC AI Conference & Expo, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed how context memory is creating an entirely new storage tier in AI clusters and why the current NAND shortage makes efficiency more critical than ever. (* Disclosure below.)
At GTC 2026, Nvidia announced BlueField-4 STX, a modular reference architecture that inserts a dedicated context memory layer between GPUs and traditional storage. The first rack-scale implementation includes the new Nvidia CMX context memory storage platform, which expands GPU memory with a high-performance context layer for scalable inference and agentic systems. The announcement validates a direction both Weka and Solidigm have been building toward, according to Stryker.
“It feels like storage kind of got a promotion this year,” he said. “That third job is new dedicated nodes specifically for storing context memory or KV cache. That’s a completely new tier of storage in an AI cluster. And, frankly, the market was already under siege and feeling intense demand before that announcement.”
Weka has been preparing for this shift since it unveiled Augmented Memory Grid at GTC 2025. At this year’s show, Chernoff pointed to a production-grade proof of concept with Firmus that delivered up to 6x improvement in tokens per second, underscoring the real-world impact of persistent KV cache storage.
“When we talk about numbers for token throughput, and we talk about things like customers never having to recompute another token unnecessarily, all of this impacts your ROI,” Chernoff said. “And that includes our partnership with Solidigm as well, because we can’t do this without you guys.”
Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the Nvidia GTC AI Conference & Expo:
(* Disclosure: Solidigm sponsored this segment of theCUBE. Neither Solidigm nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.