UPDATED 21:37 EDT / JANUARY 10 2023

AI

Microsoft unveils VALL-E, a text-to-speech AI that can mimic a voice from seconds of audio

Microsoft Corp. today provided a peek at a text-to-speech artificial intelligence tool that can apparently simulate a voice after listening to just three seconds of an audio sample.

The company said its tool, VALL-E, can keep the emotional tone of the speaker for the rest of the message while also simulating the acoustics of the room from which it first heard the voice. Not only can it do this from a short audio sample — which is unheard-of so far — but Microsoft says no other AI model can sound as natural.

Voice simulation is nothing new. In the past, it has been used to simulate human voices, but not always for the best of reasons. The concern here is that the more such AI improves, the better the audio deepfakes, and then there might be a problem.

At the moment, it’s impossible to know just how good VALL-E is since Microsoft has not released the tool to the public, although it has provided samples of the work that’s been done. It’s frankly very impressive if indeed that mimicry took only three seconds, and the voice could go on to speak for any length of time.

If it’s as good as Microsoft says it is and can quickly sound as human as a human, charisma and all, you can see why Microsoft wants to invest heavily in the AI that has just taken the world by storm, OpenAI LLC’s ChatGPT. If they’re combined, perhaps people asking questions on the phone at call centers will not be able to distinguish a human from a robot. Maybe the tools together might also be able to create what seems like a podcast, except the guest is not real.

A powerful tool that can perfectly mimic someone’s voice after just a few seconds is concerning. In the hands of the wrong people, it could be used to spread misinformation, mimicking the voices of politicians, journalists or celebrities. It seems Microsoft is well aware of the potential misuse.

“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” Microsoft said at the conclusion of the paper. “To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

Photo: Volodymyr Hryshchenko/Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.