UPDATED 09:00 EDT / AUGUST 31 2021

AI

Nvidia showcases latest research and advances in speech synthesis

Nvidia Corp. is closing the gap between synthesized speech and human voices with the launch of its most advanced conversational artificial intelligence models at the Interspeech 2021 conference today.

In a blog post, Nvidia science and AI writer Isha Salian explained that the company has come a long way in its attempts at using AI to create synthetic speech that’s indistinguishable from humans. “AI has transformed synthesized speech from the monotone of robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers,” she noted.

That said, a gap still persists as it’s very difficult to emulate the complex rhythm, intonation and timbre found in human speech. But Nvidia says it’s getting much closer to bridging that gap and is demonstrating the progress its made for all to see and inviting developers to build on its work.

Salian said the company’s advances are best illustrated by controllable speech synthesis models like RAD-TTS, which was employed in Nvidia’s winning demo during the SIGGRAPH Real-Time Live competition earlier this month. By training a text-to-speech model with audio from a human’s speech, the model can then convert any new piece of text into that person’s voice.

In addition, the RAD-TTS model is capable of voice conversion, where one speaker’s words can be delivered in another person’s voice. It can even do that when the person is singing instead of talking in a normal voice.

“Inspired by the idea of the human voice as a musical instrument, the RAD-TTS interface gives users fine-grained, frame-level control over the synthesized voice’s pitch, duration and energy,” Salian wrote. That makes it possible to do some pretty unique things, such as substituting a male speaker’s voice with words from a female narrator.

Bryan Catanzaro, vice president of applied deep learning research at Nvidia, said in a press briefing that speech research is a strategic area for the company because it has literally dozens of potential applications, ranging from live captioning in video conferences to medical transcriptions, chatbots with speech interfaces and more. “We feel like it’s a good time to make these technologies more useful,” he said.

Salian said Nvidia is making many of its advances available to the open-source community through its newly launched Nvidia NeMo toolkit on its NGC hub of AI software.

Nvidia NeMo is an open-source Python toolkit for graphics processing unit-accelerated conversational AI, meant to help researchers and developers create, experiment with, and fine-tune speech models for different applications. Included in the kit are various easy-to-use application programming interfaces and models pre-trained to help researchers customize their models for text-to-speech, natural language processing and real-time automated speech recognition.

Some of those models have been trained for tens of thousands of hours on audio data using Nvidia’s GPU systems. Developers can now take those models and fine-tune them for a range of use cases.

Salian said possible applications go beyond the simple voiceover work demonstrated in the video, such as aiding people with vocal disabilities or helping people to translate between languages in their own voice. The AI models can even be used to recreate the performances of iconic singers, matching not just the melody of the song but also the emotional expression of vocals.

Besides making the Nvidia NeMo models available, Nvidia researchers are holding various talks at Interspeech to showcase the company’s advances in speech synthesis.

With reporting from Robert Hof

Image: Nvidia

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU