UPDATED 11:00 EDT / MARCH 11 2026

AI

New memory architecture targets AI inference bottlenecks 

Lightbits Labs Ltd. today is introducing a new architecture aimed at addressing one of the most stubborn bottlenecks in large-scale artificial intelligence inference: the growing mismatch between the memory demands of large language models and the limited memory capacity of graphics processing units.

The company announced a collaborative design with ScaleFlux Inc. and FarmGPU Inc. that combines high-performance nonvolatile memory express storage, managed GPU inference infrastructure and Lightbits’ LightInferra software to make it easier for AI systems to persist and reuse the key-value cache data generated during inference. The approach is intended to reduce GPU stalls caused by repeatedly recomputing context and thus improve the efficiency of long-context AI workloads.

The announcement comes as cloud operators such as the AI-specific neoclouds struggle with the economics of running inference workloads, where the cost of GPU infrastructure often dominates operating expenses.

“GPUs are pretty expensive resources, and they’re mandatory to run LLMs, which is the core of any inference solution,” said Abel Gordon, chief technology officer at Lightbits Labs. Improving how those expensive GPUs are utilized is the central design goal of the new platform.

Improving inference efficiency ultimately comes down to increasing the number of requests each GPU can serve, Abel said.

“The ability to run more requests per GPU has a direct impact on the cost per token,” Gordon said. “By pairing our managed service with Lightbit’s high-performance storage running on ScaleFlux NVMe, we are able to lower time to first token and increase utilization on GPUs, drastically lowering the total cost of ownership for inference.”

Lightbit said its tests have shown up to a tripling of inference requests on the same GPUs with a 65% reduction in power and infrastructure costs.

KV-cache challenge

At the heart of the problem is the Key-Value cache, which stores intermediate attention vectors generated during inference. These cached values allow models to reuse prior computation rather than recomputing results repeatedly.

“The KV cache keeps what’s called attention vectors, which basically remember previous computations,” Gordon said. “When you are processing inference requests, you can get data you already processed from the past instead of recomputing that data.”

However, the size of that cache has been growing rapidly as models expand and context windows increase. Lightbit said the amount of memory required by the KV cache has been more than doubling every year.

The problem becomes particularly acute as organizations push toward longer context windows to support applications such as large knowledge bases, enterprise document search and persistent digital assistants, said Arthur Rasmusson, director of AI architecture at Lightbits Labs. “The speed requirements of the LLMs is far outstripping the amount of memory that fit on these chips,” he said.

Predictive data movement

LightInferra’s approach is to manage how data moves through multiple layers of memory, from network storage to system memory to GPU caches. The system predicts what information will be needed next and pre-positions it closer to the processor. It borrows concepts from CPU architectures that have been used for decades to prevent processors from stalling while waiting for data.

With conventional architectures, “the GPU has to pause and copy to memory,” Rasmusson said. “This is where we see the opportunity. We want to keep those GPUs saturated.”

LightInferra models access patterns and latency across the memory stack to determine when and where to place data. The goal is to keep inference pipelines operating smoothly even when working sets exceed GPU memory capacity.

“We adjust our data locality to make sure GPUs are not waiting on those data copies,” Rasmusson said. Improving the speed with which tokens are generated ultimately allows operators to increase throughput without adding more hardware.

Cloud and neocloud providers “can either reduce their GPU footprint, or deliver increased quantities of overall throughput in the cluster within their existing footprint,” he said.

The architecture is currently entering a design-partner stage, primarily with neoclouds, with production deployment scheduled for July.

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.