UPDATED 14:42 EDT / MARCH 23 2026

Andy Pernsteiner, field CTO of Vast Data, talk to theCUBE about AI inference gains at the Nvidia GTC AI Conference & Expo 2026 AI

Vast Data and Nvidia target storage to unlock next-gen AI inference scale

As AI inference demand grows, storage is becoming the pressure point that determines how far GPUs can truly scale.

The pressure is especially acute for enterprises deploying agentic workflows, where massive fleets of agents send constant inference requests to GPU servers. Offloading previously computed attention data from high-bandwidth memory to intelligent storage tiers can significantly improve throughput without expanding GPU footprints, according to Andy Pernsteiner (pictured), field chief technology officer of data infrastructure company Vast Data Inc.

“We’re working with [Nvidia Corp.] on their Dynamo inference engine, specifically around tuning of [key-value] cache as it relates to offloading attention data from the GPU to compute off into a cache that can grow exponentially and allow for inference, not just for users who are using chatbots, but also for agents who are constantly throwing inference requests at GPU servers,” Pernsteiner told theCUBE. “We see a 10X improvement in inference capability out of a single GPU server.”

Pernsteiner spoke with theCUBE’s John Furrier at the Nvidia GTC AI Conference & Expo, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed how storage-layer AI inference optimization is reshaping GPU economics and enterprise AI deployment strategies. (* Disclosure below.)

AI inference optimization through intelligent storage tiering

Nvidia poised Dynamo as an open-source, low-latency, modular inference framework for serving generative AI models in distributed environments. Vast Data’s work with Nvidia on KV cache offloading — shifting stored model context out of GPU memory to free up compute capacity — fits directly into that architecture. By moving previously computed session data to storage, GPUs can handle more active sessions instead of repeatedly recalculating the same context, Pernsteiner explained.

“If a GPU isn’t busy having to recalculate previously computed session data, then it can easily service another request and asynchronously fetch that session data,” he said. “That’s the work that we’ve been working with Nvidia on.”

Enterprise organizations face another obstacle in scaling AI inference into production: security. Many remain stuck in pilot phases because roll-your-own retrieval-augmented generation implementations often lack the policy enforcement needed to protect regulated data across business units, according to Pernsteiner. Vast Data has responded by integrating a policy model with Nvidia’s pipeline deployment framework to provide end-to-end security.

“We’ve talked to many customers this week where they’ve been in the pilot phase and what’s stopping them is this security and the ability to scale,” Pernsteiner said. “So, we’re giving them a path to do that.”

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of Nvidia GTC AI Conference & Expo:

(* Disclosure: Vast Data sponsored this segment of theCUBE. Neither Vast nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.