Nvidia debuts new software to boost AI model performance on its high-end chips
Nvidia Corp. today announced a new open-source software suite called TensorRT-LLM that expands the capabilities of large language model optimizations on Nvidia graphics processing units and pushes the limits of artificial intelligence inference performance after deployment.
Generative artificial intelligence large language models have become popular thanks to their impressive capabilities and expanded the envelope for what’s possible with AI. They are being put to use across numerous industries to allow users to “talk to their data” with chatbots, summarize large documents, write software code and discover new ways to understand information.
“LLM inference is getting harder,” said Ian Buck, vice president of hyperscale and high-performance computing at Nvidia. “The models are increasing in complexity and as they get smarter, they get bigger, which is natural, but as they expand beyond the scope of a single GPU and have to run across multiple GPUs, that becomes a problem.”
In AI, inference is the process by which a model deals with new data it has never seen before, such as when it’s tasked with summarization, code production, providing advice or answering questions. It’s the workhorse of a large language model.
As the ecosystem of models has been expanding rapidly, models are getting bigger and getting even more capabilities. That also means they’re getting so large that they don’t fit on single GPUs and must be split apart. Developers and engineers must manually split them apart, or fragment, their workloads and coordinate execution in order to get responses in real time. TensorRT-LLM helps solve this with “tensor parallelism,” which allows for efficient inference at large scale across multiple GPUs.
Additionally, since there are a wide variety of LLMs on the market today, Nvidia optimized its kernel for the major ones that are in operation today. The software suite includes fully optimized, ready-to-run versions of LLMs, including Meta Platform Inc.’s Llama 2, OpenAI LP’s GPT-2 and GPT-3, Falcon, MosaicMPT and BLOOM.
In-flight batching to handle dynamic workloads
Because of the nature of LLMs, their workloads can be highly dynamic and their workload needs and task usages can change over time. A single model could be used simultaneously as a chatbot for questions and answers and to summarize large and short documents. As a result, the outputs could vary in size by completely different orders of magnitude.
In order to handle these sort of different workloads, TensorRT-LLM introduced a mechanism known as “in-flight batching,” a process that optimizes scheduling that allows text generation processes to be broken down into multiple fragments so that they can be shifted in and out of the GPU. This way a whole batch doesn’t need to be finished before a new one is started.
Prior to this, if a large request happened, such as a summarization request for an extremely large document, everything behind it would have to wait for that process to finish before the queue could move forward.
Nvidia has been working with numerous companies to optimize TensorRT-LLM, including Meta, Cohere Inc., Grammarly Inc., Databricks Inc. and Tabnine Ltd. With their assistance, Nvidia has been streamlining the capabilities and toolset present in the software suite including an open-source Python application user interface for defining and optimizing new architectures to customize LLMs.
For example, MosaicML added extra features on top of TensorRT-LLM when it integrated it with its existing software stack. Naveen Rao, vice president of engineering at Databricks, said it was a simple process.
“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantization and more, and is efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”
Nvidia claims that implementing TensorRT-LLM and its benefits, including the use of in-flight batching, can result in a more than double performance boost for inference in article summarization using the Nvidia H100. In tests against the A100 using the GPT-J-6B model for CNN/Daily Mail article-summarization, the H100 alone was four times faster than the A100 and eight times faster once TensorRT-LLM’s optimizations were enabled.
TensorRT-LLM brings developers and engineers a deep learning compiler, optimized LLM kernels, pre- and post-processing and multi-GPU/multi-node communication capabilities with a simple open-source API so they can get to quickly optimize and execute LLMs for inference in production. As LLMs continue to reshape the nature of datacenters, the higher performance needed for enterprise use means that developers more than ever need tools that will provide them features and access to provide more performant results.
The TensorRT-LLM software suite is now available in early access to developers in the Nvidia developer program and will be integrated into the NeMo framework next month, which is part of Nvidia AI Enterprise, the company’s end-to-end software platform for production AI.
Image: Nvidia
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU