UPDATED 09:00 EDT / JULY 20 2021

AI

Nvidia accelerates AI inference performance with TensorRT 8 boosts

Nvidia Corp. is speeding up artificial intelligence inference with the launch of the next generation of its TensorRT software today.

TensorRT 8 is the eighth iteration of Nvidia’s popular AI software that’s used for high-performance deep learning inference. The software combines a powerful deep learning optimizer with a runtime that delivers low-latency, high-throughput inference for a range of AI applications.

Inference is an important aspect of AI. Whereas AI training refers to the development of an algorithm’s ability to understand a dataset, inference regards its ability to act on that information and infer answers to specific queries.

If it’s going to be useful in the real world, AI needs to be able to infer quickly. And that becomes all the more important as applications become more complex and deal with ever-growing amounts of data.

Nvidia said in a blog post today that TensorRT 8 is able to slash inference time in half compared with the previous iteration of the software, meaning it can be used to develop high-performing search engines, ad recommendation systems and chatbots that can be deployed in the cloud or at the network edge.

That’s thanks to some transformer optimizations in TensorRT 8, which Nvidia said deliver “record-setting speed for language applications.” The new software can, for example, run BERT-Large, one of the world’s most widely used transformer-based models, in just 1.2 milliseconds, Nvidia said. Previously, AI researchers would have to reduce their model size to run BERT-Large at this speed, but doing so would result in less accurate results. With TensorRT 8, it’s possible to double or triple an AI model’s size and still achieve dramatic improvements in accuracy, the company claimed.

Greg Estes, a vice president of developer programs at Nvidia, said AI models are growing exponentially more complex. At the same time, worldwide demand is surging for real-time applications that use AI. “The latest version of TensorRT introduces new capabilities that enable companies to deliver conversational AI applications to customers with a level of quality and responsiveness that was never before possible,” he said.

The new TensorRT release brings two key new features to the table that also speed up AI inference performance. They include a technique called sparsity that enables increased efficiency in Nvidia Ampere graphics processing units, so developers can accelerate neural networks by reducing the computational operations those chips perform.

Also new is something called quantization aware training, which allows developers to use trained models to run inference in INT8 precision, without losing accuracy. By doing this, the company said, the compute and storage overhead is reduced significantly, enabling tensor cores to work more efficiently.

Holger Mueller of Constellation Research Inc. told SiliconANGLE that software such as TensorRT is just as important as the hardware that it runs on when it comes to AI. The new release of TensorRT, being one of the most popular inference software platforms around, is therefore a big deal, he said.

“The key improvements are all around inference speed,” Mueller said. “The most interesting feature looks to be the new sparsity technique that improves efficiency. In a few months we’ll likely have some real-world customer proof points that illustrate TensorRT 8’s performance gains.”

The well-known AI startup Hugging Face Inc., which created the open-source Transformers library of natural language neural networks, said it has already been using TensorRT8 to create new services that enable text analysis, neural search and conversational applications at scale.

“We’re closely collaborating with Nvidia to deliver the best possible performance for state-of-the-art models on Nvidia GPUs,” said Hugging Face Product Director Jeff Boudier. “The Hugging Face Accelerated Inference API already delivers up to 100x speedup for transformer models powered by NVIDIA GPUs. With TensorRT 8, Hugging Face achieved 1ms inference latency on BERT, and we’re excited to offer this performance to our customers later this year.”

Nvidia said TensorRT 8 is generally available now and will be free to all members of the Nvidia Developer Program. New versions of TensorRT 8’s plug-ins, parsers and samples are available through an open-source license via the TensorRT GitHub repository.

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU