UPDATED 19:02 EST / SEPTEMBER 08 2016

NEWS

DeepMind’s WaveNet uses neural nets to help computers talk to us

Computers are getting better and better at understanding human speech thanks to powerful data tools like deep learning and neural networks. Now, Google parent company Alphabet Inc.’s DeepMind unit is applying the same tools to the opposite problem: getting computers to talk to people.

More specifically, DeepMind is working on WaveNet, an advanced text-to-speech synthesis tool that uses neural networks to determine the right combinations of sounds required to create individual spoken words. According to DeepMind, this is a very different method than what most other text-to-speech programs use, which rely on databases of pre-recorded sounds that are cut and pasted together to form words. This is what makes many speech programs sounds somewhat cold and robotic, much like Texas Instruments’ old Speak and Spell toys from the 1980s.

This process of speech synthesis has been used by a wide variety of text-to-speech software over the years, including intelligent assistants like Apple’s Siri and Microsoft’s Cortana. It has also been used in interesting projects like Yamaha Corp.’s Vocaloid software, a Japanese music creation program that allows users to change the pitch and rhythm of synthesized speech to create songs with artificial singers. The most famous of these is Hatsune Miku, a completely artificial pop star who recently toured the US as a holographic display.

Artificial Japanese pop star Hatsune Miku

DeepMind explained that this method limits the possibilities of text-to-speech, but with neural networks and Wavenet, a greater variety of sounds and voices are possible.

“Generating speech with computers — a process usually referred to as speech synthesis or text-to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances,” the DeepMind team explained in a blog post. “This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database.”

What makes WaveNet special is its ability to create the raw waveform of of an audio signal “one sample at a time,” meaning that it alters speech sounds up to tens of thousands of times per second.

As with all deep learning tools, DeepMind first fed WaveNet a large chunk of real-world data in the form of recorded voices. The artificial intelligence learned from those sounds and created its own models for how to form words.

Does this mean no more robot voices?

You can listen to a few samples of WaveNet in action on DeepMind’s blog with examples in both English and Mandarin Chinese. So far the results are impressive, sounding almost identical to real human speech, albeit with a slightly mechanical tone in some examples.

The ability for a computer to speak realistically is only one component of a much larger movement to create artificial intelligences that can interact in an authentic, believable way. The implications for this movement stretch across multiple industries, affecting everything from customers service to virtual assistants and beyond.

DeepMind even noted that WaveNet, like Yamaha’s Vocaloid, can be used to create music, but rather than mimicking human singing, DeepMind showcased WaveNet’s ability to simulate the sounds of a piano.

“WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general,” the DeepMind team said. “The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.”

You can read DeepMind’s extremely technical paper on how WaveNet works here.

Top photo by tehusagent
Miku photo by BenceVocafan (Own work) [CC BY-SA 4.0], via Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

DeepMind’s WaveNet uses neural nets to help computers talk to us

Does this mean no more robot voices?

Top photo by tehusagent
Miku photo by BenceVocafan (Own work) [CC BY-SA 4.0], via Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Vast Forward 2026

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

DeepMind’s WaveNet uses neural nets to help computers talk to us

Does this mean no more robot voices?

Top photo by tehusagent Miku photo by BenceVocafan (Own work) [CC BY-SA 4.0], via Wikimedia Commons

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Vast Forward 2026

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Cookies

Top photo by tehusagent
Miku photo by BenceVocafan (Own work) [CC BY-SA 4.0], via Wikimedia Commons