UPDATED 12:59 EDT / MARCH 16 2026

INFRA

Nvidia GTC 2026: Jensen Huang’s Groq ‘Mellanox moment’ and the inference land grab

SPECIAL BREAKING ANALYSIS by Dave Vellante

Ahead of Nvidia Corp.’s GTC 2026 this week, we reiterate our thesis that the center of gravity in artificial intelligence is shifting from “How fast can you train?” to “How well can you serve?”

Training has ushered in the modern AI era. Inference is where the monetization rubber meets the proverbial payback road. Token economics, latency requirements, power constraints, memory bottlenecks, NAND prices and ultimately end customer willingness to pay, will determine how fast and how much AI adopters can benefit. In his remarks on the last Nvidia earnings call, Chief Executive Jensen Huang hinted that Nvidia intends to push harder into low-latency inference with Groq’s decoder technology – and he’s telegraphing that we’ll see the specifics today at GTC. Low-latency inference is where the edge lights up, where agentic systems deliver value, and where infrastructure investments start to pay off.

Jensen essentially told investors two things on his last earnings call. First, he referenced Nvidia’s non-exclusive $20 billion licensing agreement with Groq for low-latency inference technology and said it will “extend Nvidia’s architecture with Groq’s innovations” the way it extended the architecture with Mellanox – and he explicitly said “we look forward to sharing more at GTC next month.”

Second, he reinforced the logic of his recent move in that CUDA plus architectural compatibility lets Nvidia package software optimization into one stack and have it benefit Hopper, Blackwell, Ampere – extending useful life, improving performance per dollar and per watt, and giving customers an onramp to a new flywheel. Jensen hinted on his earnings call that Groq becomes an “accelerator” inside that broader architecture – alluding to the Mellanox playbook, but aimed at the inference/decoder opportunity.

With that as the setup, we hosted a theCUBE + NYSE Wired panel – The Inference Engine: Building AI That Performs at Scale – to test what “inference at scale” really means when you look at the reality of the many constraints technologists face. The panel, hosted at theCUBE’s Palo Alto studio, underscored that the inference market is rapidly expanding, but it is not a single market. It’s a fragmented portfolio of workloads with different success metrics, different bottlenecks, and varying economics. Identifying clear horizontal monetization opportunities at scale remains elusive. We are still in the “Build it and they will come” phase of inference, in our view.

The market firmly believes inference is the next compute wave

In our panel, Sid Sheth (d-Matrix) summarized sentiment saying inference “isn’t that much of a secret anymore,” especially “after the Nvidia-Groq deal” – the industry now acknowledges “the next big wave of AI computing is going to be around inference.”

We agree with his second point even more than the first in that inference is not one-size-fits-all. It runs in big data centers, small data centers, and edge environments – with big models and small models – and “different metrics of success.” That’s the real market dynamic, which makes granular sizing difficult. The “training winner-take-most” era was created by a default stack owned by Nvidia. The key question is will that same dynamic carry through to inference. In other words, does the Nvidia/Groq deal validate alternatives or will it blow them out of the market. The key determinants will be latency, context length, cost, throughput and power; and how these metrics present themselves differently by workload.

The assumption is the market is so large and fragmented that while a leader like Nvidia will do well and perhaps take most, there will be enough white space left for competitors.

Inference has too many ‘right answers’ – which is why specialization shows up

Mitesh Agrawal (Positron) posed inference as “yes and no” on whether every deployment is a “snowflake,” meaning the workload definition changes by buyer priorities, time to first token, latency, time to last token, context length, memory, and throughput.

He also made a point that sometimes gets lost in the market’a narrative in that Nvidia GPUs have been the default for inference workloads because they’ve been the best “on a dollar basis,” but significant opportunities exist for alternatives that can deliver fast speeds and optimize expensive memory resources, especially as KV caches expand with code generation and video generation.

This ties directly to why Jensen’s Groq hints we mentioned up front are so important. Specifically, inference at the edge was the one glaring gap in Nvidia’s massive portfolio. The Groq deal closes that gap. If Nvidia is about to put a low-latency decoder path inside the Nvidia stack, that’s an attempt to collapse one of the highest-value inference opportunities back into the CUDA ecosystem – the same way Mellanox collapsed networking advantage into the Nvidia platform. Jensen is essentially saying “we’re not passing on the low-latency opportunity and the best path is inside our control plane.”

The nagging constraint remains power, cooling and the grid

As observers often argue about model benchmarks, the infrastructure builders are staring at glaring energy deficiency. Felix Ejeckam (Akash Systems) explained that there isn’t enough power in the grid to support the compute trajectory, and the stress increases as inference deployments ramp up.

Akash’s pitch is that reducing the cooling load with lab-grown diamond applied directly on GPUs, dropping temperatures by ~10–15°C and pushes PUE closer to 1.0, without having to rebuild the facility. We haven’t validated the exact economics but believe the claims are directionally correct. The point is inference economics rely on solving for power and cooling story as much as silicon.

We also note the investor commentary from Sam Awrabi (Banyan Ventures) who said the idea that “hardware costs all the money” misses that power can be a meaningful component of total cost. That’s a major reason inference is becoming the new battleground – i.e. as inference scales with usage, usage scales power, power scales the bill. So reducing power enables more tokens to be generated at lower costs.

Memory becomes a supply chain weapon

The panel conversation turned to memory as pricing pressure becomes a gating factor. Sid Sheth emphasized d-Matrix intentionally avoided CoWoS and HBM, using stacked custom DRAM and LPDDR tiers to reduce exposure to the most constrained parts of the Nvidia-centric supply chain.

Mitesh added a broader point that memory pricing increases flow through the whole stack (HBM to DRAM to LP5X), and even beyond price, allocation is the real bottleneck – “good luck getting allocation for CoWos and HBM ahead of Nvidia… then Broadcom ecosystem… then AMD… then Amazon… then Microsoft… then Meta.”

Our perspective on this is fabrication capacity is a key constraint that is often overlooked. Datacenter accelerators are sucking up fab capacity from TSMC these days as suppliers such as Nvidia (for GPUs), Broadcom (for TPUs and the like) and others are making much more aggressive volume growth commitments to TSMC relative to consumer product chip designers.

At a high level, there are two main constraints we’re monitoring – the front end and back end of the semiconductor manufacturing process. Front end capacity refers to the upstream wafer fabrication capacity – i.e. advanced logic process nodes – where the silicon and logic circuits are placed on the wafers. Back end (or sometimes called mid step) is where CoWoS (Chip-on-Wafer-on-Substrate) that our two guests mentioned comes into play. CoWos is a form of advanced packaging where the fabricated chips are integrated with high-bandwidth memory or HBM, substrates and the like to create the final accelerator packages.

Fabs such as TSMC have to balance the front end and back-end capacity. Last year the back end was a major constraint and while still acute, the bottleneck is shifting to the front end of the process. The point is AI demand is exploding, but silicon wafer production isn’t keeping.”

The relevance for GTC is Jensen’s architectural-compatibility argument is also a supply chain argument. When the same CUDA-optimized work benefits a large installed base for years, older installed bases keep producing revenue – and customers can more easily tolerate upgrades on Nvidia’s cadence because the stack remains current. That reduces churn, raises switching costs and creates lock-in, underscoring a subtle but powerful inference moat.

What we’re watching at GTC: The inference moment and the decoder path

The panel gave us a look at what the inference era looks like:

Inference is a fragmented market – data center to edge – and low latency is one of the highest-value segments but horizontal plays will be the most profitable.
Success metrics will differ by workload – time to first token and time to last token are important for user-facing and agentic workflows; cost per token and performance per watt are important to operators.
Memory and power are acute constraints – any inference architecture that improves KV cache performance, economics, bandwidth pressures or cooling overhead wins market appeal.

If Jensen’s Mellanox analogy comes to fruition, we expect Nvidia to present Groq as a platform extension that will most definitely not be a bolt-on to its impressive product line. It likely presents itself as a capability that preserves CUDA’s “write once, run everywhere” advantage while improving latency-sensitive inference workloads. That is how Nvidia keeps its edge and inference story inside its architecture – even when the Groq deal is technically non-exclusive.

Key panel takeaways

Inference has moved from “hidden gem” to the next compute wave – and the Nvidia – Groq deal accelerated the industry’s acknowledgement of that shift.
Inference is not one market – it’s many markets – defined by different metrics (time to first token, time to last token, context length, throughput, power, cost).
Specialization is inevitable because KV caches, memory heirarchies and power constraints vary by workload – and this creates room for new architectures even as Nvidia remains the dominant player.
Power, cooling, and memory are no longer “data center problems” – they are inference constraints as well, and they will shape winners as much as model quality.
We expect Jensen to use GTC to position Groq as a platform extension – a Mellanox-style move aimed at keeping ultra-low-latency inference within Nvidia’s architectural purview.

Bottom line

We believe GTC 2026 will be remembered as the moment Nvidia brings a much stronger inference story into its platform. Jensen’s “we’ll share more at GTC” hints suggest the unveiling of a Groq roadmap that is likely to reset the narrative around inference. Putting a low-latency decoder path inside Nvidia’s stack will extend the useful life of the installed base in our view. Organizations that align with Nvidia’s strategy will likely see the best performance per watt per dollar improvements at a fast pace.

That said, the market for inference is so large that alternatives will find success where ultra-low latency needs, niche workloads and supply constraints create opportunities. Inference is where revenue growth meets physical constraints – and the winners will be the companies that translate the nuances of the inference market into predictable performance, lower operating cost, and deployable systems across data centers and edge environments.

What are your thoughts on the opportunity for AI inference at the edge? Where are the opportunities? What are the risks you see and how can they be mitigated?

Photo: theCUBE Research

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Nvidia GTC 2026: Jensen Huang’s Groq ‘Mellanox moment’ and the inference land grab

The market firmly believes inference is the next compute wave

Inference has too many ‘right answers’ – which is why specialization shows up

The nagging constraint remains power, cooling and the grid

Memory becomes a supply chain weapon

What we’re watching at GTC: The inference moment and the decoder path

Key panel takeaways

Bottom line

Photo: theCUBE Research

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Nutanix .NEXT 2026

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

Nvidia GTC 2026: Jensen Huang’s Groq ‘Mellanox moment’ and the inference land grab

The market firmly believes inference is the next compute wave

Inference has too many ‘right answers’ – which is why specialization shows up

The nagging constraint remains power, cooling and the grid

Memory becomes a supply chain weapon

What we’re watching at GTC: The inference moment and the decoder path

Key panel takeaways

Bottom line

Photo: theCUBE Research

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Nutanix .NEXT 2026

KubeCon + CloudNativeCon EU 2026

RSAC 2026 Conference

Nvidia GTC 2026

Google Cloud AI Agents in Action Series 2025/2026

Cookies