UPDATED 19:11 EST / MAY 25 2017

EMERGING TECH

Baidu’s text-to-speech AI can replicate hundreds of accents

Virtual assistants give our smart devices seemingly human personalities, but despite their efforts to sound like real people, programs such as Apple Inc.’s Siri and Microsoft Corp.’s Cortana still sometimes sound a little robotic. Chinese web giant Baidu Inc. aims to change that with Deep Voice, an artificial intelligence designed to convert text into believable human speech.

Today, Baidu announced the release of Deep Voice 2, the second iteration of its text-to-speech AI that uses deep learning to accurately replicate human speech. According to the company, in just three months its AI has rapidly expanded from generating only 20 hours of speech in one voice to generating hundreds of hours of speech using hundreds of different synthetic voices.

Baidu said that unlike similar TTS neural nets, Deep Voice 2 generates speech in real time, “as fast as it needs to be played.” The company also boasted that the AI can learn from relatively short recordings of many different voice sources. In a paper outlining the methodology behind Deep Voice 2, Baidu explained that this is a major breakthrough for TTS technology.

“Most TTS systems are built with a single speaker voice, and multiple speaker voices are provided by having distinct speech databases or model parameters,” the company said in its paper. “As a result, developing a TTS system with support for multiple voices requires much more data and development effort than a system which only supports a single voice.”

Baidu claims that Deep Voice 2 is 400 times faster than other TTS systems such as Google Inc.’s WaveNet, and the company believes that its AI could offer a powerful solution to improving interactive media and conversational interfaces. Deep Voice 2’s ability to replicate accents could be especially valuable for companies looking to roll out voice interfaces to multiple regions, as it would simplify the process of localizing the device’s speech. The AI could also allow users to swap out the voices used in their apps, giving their smart devices more customizable personalities.

You can listen to several samples of speech from Deep Voice 2 on Baidu’s website, which show a small portion of the many voices the AI can use.

Deep Voice 2 is one of many AI projects that Baidu has in the works. The company also developed a speech-to-text program called Deep Speech 2, which it used to launch its own “voice first” keyboard app last year. In September, the company announced a partnership with chip maker Nvidia Corp. to provide cloud-updated 3D maps for Nvidia’s self-driving car projects. Then in March, Baidu revealed that it would be opening a second AI research facility in Silicon Valley.

Photo: simone.brunozzi – https://www.flickr.com/photos/simone_brunozzi/4469421200/, CC BY-SA 2.0, Link

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Baidu’s text-to-speech AI can replicate hundreds of accents

Photo: simone.brunozzi – https://www.flickr.com/photos/simone_brunozzi/4469421200/, CC BY-SA 2.0, Link

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Baidu’s text-to-speech AI can replicate hundreds of accents

Photo: simone.brunozzi – https://www.flickr.com/photos/simone_brunozzi/4469421200/, CC BY-SA 2.0, Link

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Cookies