INFRA
INFRA
INFRA
Ahead of Nvidia Corp.’s GTC 2026 this week, we reiterate our thesis that the center of gravity in artificial intelligence is shifting from “How fast can you train?” to “How well can you serve?”
Training has ushered in the modern AI era. Inference is where the monetization rubber meets the proverbial payback road. Token economics, latency requirements, power constraints, memory bottlenecks, NAND prices and ultimately end customer willingness to pay, will determine how fast and how much AI adopters can benefit. In his remarks on the last Nvidia earnings call, Chief Executive Jensen Huang hinted that Nvidia intends to push harder into low-latency inference with Groq’s decoder technology – and he’s telegraphing that we’ll see the specifics today at GTC. Low-latency inference is where the edge lights up, where agentic systems deliver value, and where infrastructure investments start to pay off.
Jensen essentially told investors two things on his last earnings call. First, he referenced Nvidia’s non-exclusive $20 billion licensing agreement with Groq for low-latency inference technology and said it will “extend Nvidia’s architecture with Groq’s innovations” the way it extended the architecture with Mellanox – and he explicitly said “we look forward to sharing more at GTC next month.”
Second, he reinforced the logic of his recent move in that CUDA plus architectural compatibility lets Nvidia package software optimization into one stack and have it benefit Hopper, Blackwell, Ampere – extending useful life, improving performance per dollar and per watt, and giving customers an onramp to a new flywheel. Jensen hinted on his earnings call that Groq becomes an “accelerator” inside that broader architecture – alluding to the Mellanox playbook, but aimed at the inference/decoder opportunity.
With that as the setup, we hosted a theCUBE + NYSE Wired panel – The Inference Engine: Building AI That Performs at Scale – to test what “inference at scale” really means when you look at the reality of the many constraints technologists face. The panel, hosted at theCUBE’s Palo Alto studio, underscored that the inference market is rapidly expanding, but it is not a single market. It’s a fragmented portfolio of workloads with different success metrics, different bottlenecks, and varying economics. Identifying clear horizontal monetization opportunities at scale remains elusive. We are still in the “Build it and they will come” phase of inference, in our view.
In our panel, Sid Sheth (d-Matrix) summarized sentiment saying inference “isn’t that much of a secret anymore,” especially “after the Nvidia-Groq deal” – the industry now acknowledges “the next big wave of AI computing is going to be around inference.”
We agree with his second point even more than the first in that inference is not one-size-fits-all. It runs in big data centers, small data centers, and edge environments – with big models and small models – and “different metrics of success.” That’s the real market dynamic, which makes granular sizing difficult. The “training winner-take-most” era was created by a default stack owned by Nvidia. The key question is will that same dynamic carry through to inference. In other words, does the Nvidia/Groq deal validate alternatives or will it blow them out of the market. The key determinants will be latency, context length, cost, throughput and power; and how these metrics present themselves differently by workload.
The assumption is the market is so large and fragmented that while a leader like Nvidia will do well and perhaps take most, there will be enough white space left for competitors.
Mitesh Agrawal (Positron) posed inference as “yes and no” on whether every deployment is a “snowflake,” meaning the workload definition changes by buyer priorities, time to first token, latency, time to last token, context length, memory, and throughput.
He also made a point that sometimes gets lost in the market’a narrative in that Nvidia GPUs have been the default for inference workloads because they’ve been the best “on a dollar basis,” but significant opportunities exist for alternatives that can deliver fast speeds and optimize expensive memory resources, especially as KV caches expand with code generation and video generation.
This ties directly to why Jensen’s Groq hints we mentioned up front are so important. Specifically, inference at the edge was the one glaring gap in Nvidia’s massive portfolio. The Groq deal closes that gap. If Nvidia is about to put a low-latency decoder path inside the Nvidia stack, that’s an attempt to collapse one of the highest-value inference opportunities back into the CUDA ecosystem – the same way Mellanox collapsed networking advantage into the Nvidia platform. Jensen is essentially saying “we’re not passing on the low-latency opportunity and the best path is inside our control plane.”
As observers often argue about model benchmarks, the infrastructure builders are staring at glaring energy deficiency. Felix Ejeckam (Akash Systems) explained that there isn’t enough power in the grid to support the compute trajectory, and the stress increases as inference deployments ramp up.
Akash’s pitch is that reducing the cooling load with lab-grown diamond applied directly on GPUs, dropping temperatures by ~10–15°C and pushes PUE closer to 1.0, without having to rebuild the facility. We haven’t validated the exact economics but believe the claims are directionally correct. The point is inference economics rely on solving for power and cooling story as much as silicon.
We also note the investor commentary from Sam Awrabi (Banyan Ventures) who said the idea that “hardware costs all the money” misses that power can be a meaningful component of total cost. That’s a major reason inference is becoming the new battleground – i.e. as inference scales with usage, usage scales power, power scales the bill. So reducing power enables more tokens to be generated at lower costs.
The panel conversation turned to memory as pricing pressure becomes a gating factor. Sid Sheth emphasized d-Matrix intentionally avoided CoWoS and HBM, using stacked custom DRAM and LPDDR tiers to reduce exposure to the most constrained parts of the Nvidia-centric supply chain.
Mitesh added a broader point that memory pricing increases flow through the whole stack (HBM to DRAM to LP5X), and even beyond price, allocation is the real bottleneck – “good luck getting allocation for CoWos and HBM ahead of Nvidia… then Broadcom ecosystem… then AMD… then Amazon… then Microsoft… then Meta.”
Our perspective on this is fabrication capacity is a key constraint that is often overlooked. Datacenter accelerators are sucking up fab capacity from TSMC these days as suppliers such as Nvidia (for GPUs), Broadcom (for TPUs and the like) and others are making much more aggressive volume growth commitments to TSMC relative to consumer product chip designers.
At a high level, there are two main constraints we’re monitoring – the front end and back end of the semiconductor manufacturing process. Front end capacity refers to the upstream wafer fabrication capacity – i.e. advanced logic process nodes – where the silicon and logic circuits are placed on the wafers. Back end (or sometimes called mid step) is where CoWoS (Chip-on-Wafer-on-Substrate) that our two guests mentioned comes into play. CoWos is a form of advanced packaging where the fabricated chips are integrated with high-bandwidth memory or HBM, substrates and the like to create the final accelerator packages.
Fabs such as TSMC have to balance the front end and back-end capacity. Last year the back end was a major constraint and while still acute, the bottleneck is shifting to the front end of the process. The point is AI demand is exploding, but silicon wafer production isn’t keeping.”
The relevance for GTC is Jensen’s architectural-compatibility argument is also a supply chain argument. When the same CUDA-optimized work benefits a large installed base for years, older installed bases keep producing revenue – and customers can more easily tolerate upgrades on Nvidia’s cadence because the stack remains current. That reduces churn, raises switching costs and creates lock-in, underscoring a subtle but powerful inference moat.
The panel gave us a look at what the inference era looks like:
If Jensen’s Mellanox analogy comes to fruition, we expect Nvidia to present Groq as a platform extension that will most definitely not be a bolt-on to its impressive product line. It likely presents itself as a capability that preserves CUDA’s “write once, run everywhere” advantage while improving latency-sensitive inference workloads. That is how Nvidia keeps its edge and inference story inside its architecture – even when the Groq deal is technically non-exclusive.
We believe GTC 2026 will be remembered as the moment Nvidia brings a much stronger inference story into its platform. Jensen’s “we’ll share more at GTC” hints suggest the unveiling of a Groq roadmap that is likely to reset the narrative around inference. Putting a low-latency decoder path inside Nvidia’s stack will extend the useful life of the installed base in our view. Organizations that align with Nvidia’s strategy will likely see the best performance per watt per dollar improvements at a fast pace.
That said, the market for inference is so large that alternatives will find success where ultra-low latency needs, niche workloads and supply constraints create opportunities. Inference is where revenue growth meets physical constraints – and the winners will be the companies that translate the nuances of the inference market into predictable performance, lower operating cost, and deployable systems across data centers and edge environments.
What are your thoughts on the opportunity for AI inference at the edge? Where are the opportunities? What are the risks you see and how can they be mitigated?
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.