

As enterprises push deeper into their AI journeys, the industry is undergoing a critical pivot from training massive models to running them efficiently at scale. This AI inference stage is less about building the models themselves and focusing more on deploying them in production to serve billions of real-time decisions across the edge, data centers and the cloud.
Nvidia’s Harish Arora talks with theCUBE about how storage, not GPUs, is the backbone of AI inference.
While graphics processing units often take center stage, the true workhorse making AI inference possible is storage. From high-throughput pipelines to tiered architectures, storage infrastructure is what enables inference to deliver the speed, scale and reliability that modern enterprises demand, according to Harish Arora (pictured, front row, left), lead product manager at Nvidia Corp.
“Storage has a big role to play in this entire [retrieval-augmented generation] pipeline,” he said. “The vector database ingestion of all the embeddings and index creation is a very storage-centric operation. It requires high-performance storage, which is where high-performance file access and object access become important. On the point of AI factories, we are working with several high-performance object storage providers to expand the customer choice beyond just parallel file systems and NFS file systems to include high-performance objects as a great foundation for AI factories.”
In an exclusive panel discussion, Arora was joined by Tien Lee (back row, left), product manager at Super Micro Computer Inc.; Peter Sjoberg (front row, right), vice president of worldwide solution architects at Cloudian Inc.; Tony Asaro (front row, middle), chief strategy and business development officer at Hammerspace Inc.; and Ace Stryker (back row, right), director of market development at Solidigm, a trademark of SK Hynix NAND Products Solutions Corp. (AKA Solidigm). They spoke with theCUBE’s Rob Strechay at the Supermicro Open Storage Summit, during an exclusive broadcast on theCUBE, SiliconANGLE Media’s livestreaming studio. They discussed how AI’s future will depend on the ability of storage platforms to keep pace with the scale, speed and intelligence required by the next generation of inference workloads. (* Disclosure below.)
The surge in enterprise unstructured data, expected to grow from 14 zettabytes today to 30 zettabytes by 2028, is mandating the shift to AI inference, according to Arora. To meet the challenge, Nvidia introduced NeMo Retriever services, built to accelerate retrieval-augmented generation pipelines. These services process trillions of unstructured files — images, charts, videos, logs — and transform them into embeddings stored in vector databases.
“These NeMo Retriever services are actually NIMs, and NIM stands for Nvidia Inference Microservice — easy to use in a cloud-native architecture,” Arora said. “These NIMs collectively extract all the data from the various enterprise data files and generate the embeddings and keep the embeddings or store the embeddings in a vector database.”
For Hammerspace, the key next-gen AI challenge is eliminating data gravity — the friction that keeps distributed data tied to physical infrastructure. Hammerspace tackles this with its AI data platform, built on a high-performance parallel file system supporting both NFS 4.2 and S3 APIs, according to Asaro.
“We have this concept called ‘AI Anywhere,'” he said. “The idea is that your data is distributed. What Hammerspace has done is that we have eliminated data gravity. What we do importantly is we untether the data from the physical layer, we make sure we do that with a high-degree of intelligence, control, security and specificity, meaning that we have file granular control over your data, where it’s orchestrated and where you want to move it so that you get the data where you need it and when you need it.”
While object storage was once relegated to long-term retention, Cloudian is emphasizing its new role in high-performance AI inference pipelines. By pioneering GPUDirect for object storage and enabling a direct parallel path between object stores and GPU memory, Cloudian’s architecture allows systems to achieve over 200 GB/s throughput while offloading CPU workloads, according to Sjoberg.
“We see object storage as a platform that serves so many of the different AI workloads that are putting these capabilities into the platform directly,” he said. “When we talk about inference at scale, our focus for today, we want to be able to see that performance scaling to meet those needs long-term as customers grow their inference capabilities.”
S3 API compatibility is central here. By ensuring seamless integration across environments, Cloudian enables developers to use the same SDKs and workflows across cloud and on-prem object platforms. The result is a data-centric AI platform that supports ingestion, preparation, inference and long-term preservation in one ecosystem.
While object and file systems provide the backbone for inference pipelines, Solidigm’s architecture brings efficiency at the drive level. As a next-generation SSD provider, the company develops NAND technologies that overcome the constraints of physics to deliver higher endurance, power efficiency and density, according to Stryker.
“There’s no AI without data; there’s no data without infrastructure and there’s no infrastructure without efficiency,” he said. “There are a lot of things that go into building an efficient data center. Folks may not realize that storage can account for a lot of that space and a lot of that power consumption if it’s not optimized. If you’re using legacy storage, there’s probably a lot of room for improvement, and that’s where products like these Solidigm ones on the slide come in.”
For Supermicro, the emphasis is on integration — ensuring that enterprises pursuing AI inference at scale have access to a complete solution rather than isolated components. The company draws on its engineering DNA, starting from motherboards to servers and now full-stack systems, to deliver end-to-end AI infrastructure designed for efficiency, flexibility and scale, according to Tien.
“At the end of the day, any organization that’s looking to capitalize on inferencing at scale, we need a complete solution,” he said. “You will not get success by just having only one piece that works. It’s always about a total integrated solution.”
Here’s a short clip from our interview, part of SiliconANGLE’s and theCUBE’s coverage of the Supermicro Open Storage Summit:
(* Disclosure: TheCUBE is a paid media partner for the Supermicro Open Storage Summit. Neither Super Micro Computer Inc., the sponsor of theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.