UPDATED 11:00 EST / MARCH 27 2018

CLOUD

Almost human: Google offers text-to-speech technology on its cloud

For years, Google has offered the ability to convert text to speech on a number of its services, such as search, Maps and Google Assistant. Now, it’s offering the capability as a service in its cloud.

The company today announced other companies can now try out Cloud Text-to-Speech in their own services. Google’s newest machine learning service is intended to help companies develop better conversational interfaces to their services.

The service is aimed at three main markets, Dan Aharon, product manager for Cloud AI, said in an interview. The main one is voice response systems for call centers, for which Cloud Text-to-Speech can provide real-time, natural-language conversation. “We think this is going to be massively disruptive to the call center space,” he said, a somewhat more polite way of saying all those call center jobs that went to India and the Philippines may soon vanish themselves.

The other two are enabling devices in the “internet of things,” from cars to televisions to robots, to talk back to their users, and converting text such as news articles and books into speech, such as podcasts and audiobooks.

The service has 32 different voices in 12 languages and also allows application developers to customize voice pitch, speaking rate and volume gain. In a demonstration, all this made some snippets of speech from text sound very close to natural. Indeed, according to Google’s own tests, some came quite close to human speech.

Google is actually using several different text-to-speech technologies — the one it has used for years, as well as two versions from its DeepMind artificial intelligence unit that use WaveNet. Those two create raw audio waveforms from scratch rather than the traditional methods of combining actual voice samples into larger voice fragments or morphing them using transformative algorithms to make a wider variety of sounds.

googlecloudttschart

The first version of WaveNet, published in late 2016, used a so-called generative model that’s trained with a large sample of real voices and then extracts the underlying structure of the speech, such as what tones follow others. DeepMind said text converted to speech this way produces more accurate results, sometimes topping four on a scale in which human speech is rated about 4.5 (above).

More recently, Google has started using an updated version of WaveNet (pictured, top) running on Google’s Cloud Tensor Processing Unit infrastructure. It generates raw waveforms 1,000 times faster than the original model, generating a second of speech in only 50 milliseconds and offering higher fidelity. Aharon said this version gets 70 percent of the way toward sounding like human speech — though the demos sounded pretty close indeed. Including six WaveNet voices at start, then additional voices in coming months.

“It’s the closest thing to human speech than we’ve seen before,” he said. Google will offer six WaveNet voices to begin with as part of the Cloud Text-to-Speech, with more coming in the next few months.

There’s a free tier for companies using up to 4 million characters a month with the standard Cloud Text-to-Speech technology, after which there’s a charge of $4 per million characters. The WaveNet version is free up to 1 million characters, then $16 for each additional 1 million characters. The latter costs more because much more processing power is needed. But both versions are charged by fractions of the 1 million characters, so it can be pretty cheap for less use, Aharon said.

Several dozen alpha users have been trying it since November, including Cisco Systems Inc. and Dolphin ONE Communications LLP, which runs the Calll cloud telephony system.

Google isn’t alone in offering text-to-speech services via the cloud. Amazon Web Services Inc., for instance, started offering its Polly text-to-speech service in late 2016. IBM Corp. offers 13 voices in seven languages, driven by its Watson cognitive computing system, in its cloud.

Images: Google

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Almost human: Google offers text-to-speech technology on its cloud

Images: Google

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Almost human: Google offers text-to-speech technology on its cloud

Images: Google

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

SC25

Refresh North America 2025

Cookies