Almost human: Google offers text-to-speech technology on its cloud
For years, Google has offered the ability to convert text to speech on a number of its services, such as search, Maps and Google Assistant. Now, it’s offering the capability as a service in its cloud.
The company today announced other companies can now try out Cloud Text-to-Speech in their own services. Google’s newest machine learning service is intended to help companies develop better conversational interfaces to their services.
The service is aimed at three main markets, Dan Aharon, product manager for Cloud AI, said in an interview. The main one is voice response systems for call centers, for which Cloud Text-to-Speech can provide real-time, natural-language conversation. “We think this is going to be massively disruptive to the call center space,” he said, a somewhat more polite way of saying all those call center jobs that went to India and the Philippines may soon vanish themselves.
The other two are enabling devices in the “internet of things,” from cars to televisions to robots, to talk back to their users, and converting text such as news articles and books into speech, such as podcasts and audiobooks.
The service has 32 different voices in 12 languages and also allows application developers to customize voice pitch, speaking rate and volume gain. In a demonstration, all this made some snippets of speech from text sound very close to natural. Indeed, according to Google’s own tests, some came quite close to human speech.
Google is actually using several different text-to-speech technologies — the one it has used for years, as well as two versions from its DeepMind artificial intelligence unit that use WaveNet. Those two create raw audio waveforms from scratch rather than the traditional methods of combining actual voice samples into larger voice fragments or morphing them using transformative algorithms to make a wider variety of sounds.
The first version of WaveNet, published in late 2016, used a so-called generative model that’s trained with a large sample of real voices and then extracts the underlying structure of the speech, such as what tones follow others. DeepMind said text converted to speech this way produces more accurate results, sometimes topping four on a scale in which human speech is rated about 4.5 (above).
More recently, Google has started using an updated version of WaveNet (pictured, top) running on Google’s Cloud Tensor Processing Unit infrastructure. It generates raw waveforms 1,000 times faster than the original model, generating a second of speech in only 50 milliseconds and offering higher fidelity. Aharon said this version gets 70 percent of the way toward sounding like human speech — though the demos sounded pretty close indeed. Including six WaveNet voices at start, then additional voices in coming months.
“It’s the closest thing to human speech than we’ve seen before,” he said. Google will offer six WaveNet voices to begin with as part of the Cloud Text-to-Speech, with more coming in the next few months.
There’s a free tier for companies using up to 4 million characters a month with the standard Cloud Text-to-Speech technology, after which there’s a charge of $4 per million characters. The WaveNet version is free up to 1 million characters, then $16 for each additional 1 million characters. The latter costs more because much more processing power is needed. But both versions are charged by fractions of the 1 million characters, so it can be pretty cheap for less use, Aharon said.
Several dozen alpha users have been trying it since November, including Cisco Systems Inc. and Dolphin ONE Communications LLP, which runs the Calll cloud telephony system.
Google isn’t alone in offering text-to-speech services via the cloud. Amazon Web Services Inc., for instance, started offering its Polly text-to-speech service in late 2016. IBM Corp. offers 13 voices in seven languages, driven by its Watson cognitive computing system, in its cloud.
Since you’re here …
Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!
Support our mission: >>>>>> SUBSCRIBE NOW >>>>>> to our YouTube channel.
… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.