UPDATED 13:00 EDT / SEPTEMBER 08 2023

Nvidia debuts new software to boost AI model performance on its high-end chips

Nvidia Corp. today announced a new open-source software suite called TensorRT-LLM that expands the capabilities of large language model optimizations on Nvidia graphics processing units and pushes the limits of artificial intelligence inference performance after deployment.

Generative artificial intelligence large language models have become popular thanks to their impressive capabilities and expanded the envelope for what’s possible with AI. They are being put to use across numerous industries to allow users to “talk to their data” with chatbots, summarize large documents, write software code and discover new ways to understand information.

“LLM inference is getting harder,” said Ian Buck, vice president of hyperscale and high-performance computing at Nvidia. “The models are increasing in complexity and as they get smarter, they get bigger, which is natural, but as they expand beyond the scope of a single GPU and have to run across multiple GPUs, that becomes a problem.”

In AI, inference is the process by which a model deals with new data it has never seen before, such as when it’s tasked with summarization, code production, providing advice or answering questions. It’s the workhorse of a large language model.

As the ecosystem of models has been expanding rapidly, models are getting bigger and getting even more capabilities. That also means they’re getting so large that they don’t fit on single GPUs and must be split apart. Developers and engineers must manually split them apart, or fragment, their workloads and coordinate execution in order to get responses in real time. TensorRT-LLM helps solve this with “tensor parallelism,” which allows for efficient inference at large scale across multiple GPUs.

Additionally, since there are a wide variety of LLMs on the market today, Nvidia optimized its kernel for the major ones that are in operation today. The software suite includes fully optimized, ready-to-run versions of LLMs, including Meta Platform Inc.’s Llama 2, OpenAI LP’s GPT-2 and GPT-3, Falcon, MosaicMPT and BLOOM.

In-flight batching to handle dynamic workloads

Because of the nature of LLMs, their workloads can be highly dynamic and their workload needs and task usages can change over time. A single model could be used simultaneously as a chatbot for questions and answers and to summarize large and short documents. As a result, the outputs could vary in size by completely different orders of magnitude.

In order to handle these sort of different workloads, TensorRT-LLM introduced a mechanism known as “in-flight batching,” a process that optimizes scheduling that allows text generation processes to be broken down into multiple fragments so that they can be shifted in and out of the GPU. This way a whole batch doesn’t need to be finished before a new one is started.

Prior to this, if a large request happened, such as a summarization request for an extremely large document, everything behind it would have to wait for that process to finish before the queue could move forward.

Nvidia has been working with numerous companies to optimize TensorRT-LLM, including Meta, Cohere Inc., Grammarly Inc., Databricks Inc. and Tabnine Ltd. With their assistance, Nvidia has been streamlining the capabilities and toolset present in the software suite including an open-source Python application user interface for defining and optimizing new architectures to customize LLMs.

For example, MosaicML added extra features on top of TensorRT-LLM when it integrated it with its existing software stack. Naveen Rao, vice president of engineering at Databricks, said it was a simple process.

“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantization and more, and is efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”

Nvidia claims that implementing TensorRT-LLM and its benefits, including the use of in-flight batching, can result in a more than double performance boost for inference in article summarization using the Nvidia H100. In tests against the A100 using the GPT-J-6B model for CNN/Daily Mail article-summarization, the H100 alone was four times faster than the A100 and eight times faster once TensorRT-LLM’s optimizations were enabled.

TensorRT-LLM brings developers and engineers a deep learning compiler, optimized LLM kernels, pre- and post-processing and multi-GPU/multi-node communication capabilities with a simple open-source API so they can get to quickly optimize and execute LLMs for inference in production. As LLMs continue to reshape the nature of datacenters, the higher performance needed for enterprise use means that developers more than ever need tools that will provide them features and access to provide more performant results.

The TensorRT-LLM software suite is now available in early access to developers in the Nvidia developer program and will be integrated into the NeMo framework next month, which is part of Nvidia AI Enterprise, the company’s end-to-end software platform for production AI.

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Support our open free content by sharing and engaging with our content and community.

Join theCUBE Alumni Trust Network

Where Technology Leaders Connect, Share Intelligence & Create Opportunities

11.4k+

CUBE Alumni Network

C-level and Technical

Domain Experts

15M+

theCUBE

Viewers

Connect with 11,413+ industry leaders from our network of tech and business leaders forming a unique trusted network effect.

SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Nvidia debuts new software to boost AI model performance on its high-end chips

In-flight batching to handle dynamic workloads

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

AWS Mid-Year Leadership Summit 2025

RAISE Summit 2025

Blue Yonder AI and the Autonomous Supply Chain 2025

Data Protection & AI Summit 2025

Open Source Summit NA 2025

Nvidia debuts new software to boost AI model performance on its high-end chips

In-flight batching to handle dynamic workloads

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST STORIES

LATEST STORIES

AWS Mid-Year Leadership Summit 2025

RAISE Summit 2025

Blue Yonder AI and the Autonomous Supply Chain 2025

Data Protection & AI Summit 2025

Open Source Summit NA 2025

Cookies