UPDATED 09:00 EDT / AUGUST 31 2021

AI

Nvidia showcases latest research and advances in speech synthesis

by Mike Wheatley

Nvidia Corp. is closing the gap between synthesized speech and human voices with the launch of its most advanced conversational artificial intelligence models at the Interspeech 2021 conference today.

In a blog post, Nvidia science and AI writer Isha Salian explained that the company has come a long way in its attempts at using AI to create synthetic speech that’s indistinguishable from humans. “AI has transformed synthesized speech from the monotone of robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers,” she noted.

That said, a gap still persists as it’s very difficult to emulate the complex rhythm, intonation and timbre found in human speech. But Nvidia says it’s getting much closer to bridging that gap and is demonstrating the progress its made for all to see and inviting developers to build on its work.

Salian said the company’s advances are best illustrated by controllable speech synthesis models like RAD-TTS, which was employed in Nvidia’s winning demo during the SIGGRAPH Real-Time Live competition earlier this month. By training a text-to-speech model with audio from a human’s speech, the model can then convert any new piece of text into that person’s voice.

In addition, the RAD-TTS model is capable of voice conversion, where one speaker’s words can be delivered in another person’s voice. It can even do that when the person is singing instead of talking in a normal voice.

“Inspired by the idea of the human voice as a musical instrument, the RAD-TTS interface gives users fine-grained, frame-level control over the synthesized voice’s pitch, duration and energy,” Salian wrote. That makes it possible to do some pretty unique things, such as substituting a male speaker’s voice with words from a female narrator.

<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>

Bryan Catanzaro, vice president of applied deep learning research at Nvidia, said in a press briefing that speech research is a strategic area for the company because it has literally dozens of potential applications, ranging from live captioning in video conferences to medical transcriptions, chatbots with speech interfaces and more. “We feel like it’s a good time to make these technologies more useful,” he said.

Salian said Nvidia is making many of its advances available to the open-source community through its newly launched Nvidia NeMo toolkit on its NGC hub of AI software.

Nvidia NeMo is an open-source Python toolkit for graphics processing unit-accelerated conversational AI, meant to help researchers and developers create, experiment with, and fine-tune speech models for different applications. Included in the kit are various easy-to-use application programming interfaces and models pre-trained to help researchers customize their models for text-to-speech, natural language processing and real-time automated speech recognition.

Some of those models have been trained for tens of thousands of hours on audio data using Nvidia’s GPU systems. Developers can now take those models and fine-tune them for a range of use cases.

Salian said possible applications go beyond the simple voiceover work demonstrated in the video, such as aiding people with vocal disabilities or helping people to translate between languages in their own voice. The AI models can even be used to recreate the performances of iconic singers, matching not just the melody of the song but also the emotional expression of vocals.

Besides making the Nvidia NeMo models available, Nvidia researchers are holding various talks at Interspeech to showcase the company’s advances in speech synthesis.

With reporting from Robert Hof

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.