UPDATED 21:37 EST / JANUARY 10 2023

AI

Microsoft unveils VALL-E, a text-to-speech AI that can mimic a voice from seconds of audio

Microsoft Corp. today provided a peek at a text-to-speech artificial intelligence tool that can apparently simulate a voice after listening to just three seconds of an audio sample.

The company said its tool, VALL-E, can keep the emotional tone of the speaker for the rest of the message while also simulating the acoustics of the room from which it first heard the voice. Not only can it do this from a short audio sample — which is unheard-of so far — but Microsoft says no other AI model can sound as natural.

Voice simulation is nothing new. In the past, it has been used to simulate human voices, but not always for the best of reasons. The concern here is that the more such AI improves, the better the audio deepfakes, and then there might be a problem.

At the moment, it’s impossible to know just how good VALL-E is since Microsoft has not released the tool to the public, although it has provided samples of the work that’s been done. It’s frankly very impressive if indeed that mimicry took only three seconds, and the voice could go on to speak for any length of time.

If it’s as good as Microsoft says it is and can quickly sound as human as a human, charisma and all, you can see why Microsoft wants to invest heavily in the AI that has just taken the world by storm, OpenAI LLC’s ChatGPT. If they’re combined, perhaps people asking questions on the phone at call centers will not be able to distinguish a human from a robot. Maybe the tools together might also be able to create what seems like a podcast, except the guest is not real.

A powerful tool that can perfectly mimic someone’s voice after just a few seconds is concerning. In the hands of the wrong people, it could be used to spread misinformation, mimicking the voices of politicians, journalists or celebrities. It seems Microsoft is well aware of the potential misuse.

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” Microsoft said at the conclusion of the paper. “To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

Photo: Volodymyr Hryshchenko/Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU