DeepMind’s WaveNet uses neural nets to help computers talk to us
Computers are getting better and better at understanding human speech thanks to powerful data tools like deep learning and neural networks. Now, Google parent company Alphabet Inc.’s DeepMind unit is applying the same tools to the opposite problem: getting computers to talk to people.
More specifically, DeepMind is working on WaveNet, an advanced text-to-speech synthesis tool that uses neural networks to determine the right combinations of sounds required to create individual spoken words. According to DeepMind, this is a very different method than what most other text-to-speech programs use, which rely on databases of pre-recorded sounds that are cut and pasted together to form words. This is what makes many speech programs sounds somewhat cold and robotic, much like Texas Instruments’ old Speak and Spell toys from the 1980s.
This process of speech synthesis has been used by a wide variety of text-to-speech software over the years, including intelligent assistants like Apple’s Siri and Microsoft’s Cortana. It has also been used in interesting projects like Yamaha Corp.’s Vocaloid software, a Japanese music creation program that allows users to change the pitch and rhythm of synthesized speech to create songs with artificial singers. The most famous of these is Hatsune Miku, a completely artificial pop star who recently toured the US as a holographic display.
DeepMind explained that this method limits the possibilities of text-to-speech, but with neural networks and Wavenet, a greater variety of sounds and voices are possible.
“Generating speech with computers — a process usually referred to as speech synthesis or text-to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances,” the DeepMind team explained in a blog post. “This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database.”
What makes WaveNet special is its ability to create the raw waveform of of an audio signal “one sample at a time,” meaning that it alters speech sounds up to tens of thousands of times per second.
As with all deep learning tools, DeepMind first fed WaveNet a large chunk of real-world data in the form of recorded voices. The artificial intelligence learned from those sounds and created its own models for how to form words.
Does this mean no more robot voices?
You can listen to a few samples of WaveNet in action on DeepMind’s blog with examples in both English and Mandarin Chinese. So far the results are impressive, sounding almost identical to real human speech, albeit with a slightly mechanical tone in some examples.
The ability for a computer to speak realistically is only one component of a much larger movement to create artificial intelligences that can interact in an authentic, believable way. The implications for this movement stretch across multiple industries, affecting everything from customers service to virtual assistants and beyond.
DeepMind even noted that WaveNet, like Yamaha’s Vocaloid, can be used to create music, but rather than mimicking human singing, DeepMind showcased WaveNet’s ability to simulate the sounds of a piano.
“WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general,” the DeepMind team said. “The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.”
You can read DeepMind’s extremely technical paper on how WaveNet works here.
Top photo by tehusagent
Miku photo by BenceVocafan (Own work) [CC BY-SA 4.0], via Wikimedia Commons
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU