UPDATED 11:00 EDT / MARCH 27 2018


Almost human: Google offers text-to-speech technology on its cloud

For years, Google has offered the ability to convert text to speech on a number of its services, such as search, Maps and Google Assistant. Now, it’s offering the capability as a service in its cloud.

The company today announced other companies can now try out Cloud Text-to-Speech in their own services. Google’s newest machine learning service is intended to help companies develop better conversational interfaces to their services.

The service is aimed at three main markets, Dan Aharon, product manager for Cloud AI, said in an interview. The main one is voice response systems for call centers, for which Cloud Text-to-Speech can provide real-time, natural-language conversation. “We think this is going to be massively disruptive to the call center space,” he said, a somewhat more polite way of saying all those call center jobs that went to India and the Philippines may soon vanish themselves.

The other two are enabling devices in the “internet of things,” from cars to televisions to robots, to talk back to their users, and converting text such as news articles and books into speech, such as podcasts and audiobooks.

The service has 32 different voices in 12 languages and also allows application developers to customize voice pitch, speaking rate and volume gain. In a demonstration, all this made some snippets of speech from text sound very close to natural. Indeed, according to Google’s own tests, some came quite close to human speech.

Google is actually using several different text-to-speech technologies — the one it has used for years, as well as two versions from its DeepMind artificial intelligence unit that use WaveNet. Those two create raw audio waveforms from scratch rather than the traditional methods of combining actual voice samples into larger voice fragments or morphing them using transformative algorithms to make a wider variety of sounds.


The first version of WaveNet, published in late 2016, used a so-called generative model that’s trained with a large sample of real voices and then extracts the underlying structure of the speech, such as what tones follow others. DeepMind said text converted to speech this way produces more accurate results, sometimes topping four on a scale in which human speech is rated about 4.5 (above).

More recently, Google has started using an updated version of WaveNet (pictured, top) running on Google’s Cloud Tensor Processing Unit infrastructure. It generates raw waveforms 1,000 times faster than the original model, generating a second of speech in only 50 milliseconds and offering higher fidelity. Aharon said this version gets 70 percent of the way toward sounding like human speech — though the demos sounded pretty close indeed. Including six WaveNet voices at start, then additional voices in coming months.

“It’s the closest thing to human speech than we’ve seen before,” he said. Google will offer six WaveNet voices to begin with as part of the Cloud Text-to-Speech, with more coming in the next few months.

There’s a free tier for companies using up to 4 million characters a month with the standard Cloud Text-to-Speech technology, after which there’s a charge of $4 per million characters. The WaveNet version is free up to 1 million characters, then $16 for each additional 1 million characters. The latter costs more because much more processing power is needed. But both versions are charged by fractions of the 1 million characters, so it can be pretty cheap for less use, Aharon said.

Several dozen alpha users have been trying it since November, including Cisco Systems Inc. and Dolphin ONE Communications LLP, which runs the Calll cloud telephony system.

Google isn’t alone in offering text-to-speech services via the cloud. Amazon Web Services Inc., for instance, started offering its Polly text-to-speech service in late 2016. IBM Corp. offers 13 voices in seven languages, driven by its Watson cognitive computing system, in its cloud.

Images: Google

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy