Facebook AI researchers have cloned Bill Gates’ voice with uncanny accuracy
Researchers at Facebook Inc. have managed to clone Microsoft Corp. Bill Gates’ voice so well you won’t be able to tell it’s machine-generated speech.
Sean Vasquez and Mike Lewis at Facebook AI Research said Monday they’ve been working on trying to mimic human speech for some time, something that’s clearly difficult given that even the arguably most well-known speaking machine of Stephen Hawking still sounded very much like a machine.
It seems now progress has been made, and if you listen to the clone of Gates (pictured), you’ll agree. It sounds like him, and you’d be hard-pressed to tell the difference from the machine and his real voice.
Here the machine says, as Gates, “The glow deepened in the eyes of the sweet girl.” Here it clones the words, “Write a fond note to the friend you cherish.” What’s perhaps uncanny about the last sentence is how the machine gets right Gates’ unmistakable rising inflection when saying “cherish.”
The technology used to do this, called MelNet, can be used to copy human intonation. Gates’ voice and many others’ voices have so far been reproduced with such perfection. The cloned audio was taken from various Ted Talks, said Vasquez and Lewis.
The researchers said that up until recently, the reason why text-to-speech software hasn’t worked very well is that it used waveform recordings. These show how sounds change in scale in a matter for seconds. If you hear that word “cherish” uttered by Gates, the tone shifts quite a lot. The deep-learning machine when trying to mimic a person must guess all these small shifts, no easy task.
Vasquez and Lewis said they managed to clone voices much more accurately by using something called a spectrogram to train the machine.
“The temporal axis of a spectrogram is orders of magnitude more compact than that of a waveform, meaning dependencies that span tens of thousands of timesteps in waveforms only span hundreds of timesteps in spectrograms,” said the researchers. “This enables our spectrogram models to generate unconditional speech and music samples with consistency over multiple seconds.”
There are some setbacks, though. The team said that though they can reproduce a sentence almost perfectly, it won’t be able to replicate “intonation to indicate changes in topic or mood as stories evolve over tens of seconds or minutes.” Still, when it comes to human and computer interaction, the team said, this technology could be transformative in terms of conversations that involve only short phrases.
Photo: Gisela Giardino/Flickr
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU