

Microsoft Corp.’s artificial intelligence research team said it has made a significant breakthrough with its natural-language speech recognition technology.
The company announced on Monday that it has finally achieved a human-parity word error rate of 5.1 percent in a conversational speech recognition task known as Switchboard, which is a collection of recorded phone conversations used as a benchmark for the technology. What this means is that Microsoft’s natural speech recognition systems are now able to recognize spoken words just as well as humans can.
It also means that Microsoft has broken its own record of a 5.9 percent word error rate in Switchboard, which it set last year.
Microsoft fellow Xuedong Huang said the achievement was made possible by advances in AI that have helped the company push the boundaries of what its speech recognition software can accomplish.
“We reduced our error rate by about 12 percent compared to last year’s accuracy level, using a series of improvements to our neural net-based acoustic and language models,” Huang said in the announcement. “We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling.”
Microsoft also relied on the help of its Cognitive Toolkit, which is a suite of AI technologies. In addition, the research team said it benefited from the performance boost provided by graphics processing unit-assisted Azure virtual machines. These virtual machines use graphics processing units built by Nvidia Corp. to speed up AI workloads, the company said.
Microsoft has published a detailed technical report paper on the achievement for those who want a deeper understanding of how it works.
Despite the new milestone, Microsoft said it still has a long way to go in creating speech recognition technology that’s equal to humans. Huang noted that challenges remain in attaining human levels of speech recognition in noisy environments. In addition, he said that some languages and accents can still cause problems for computer systems, as there is limited training data available. There’s also a long way to go before computer systems can actually understand speech, instead of just recognizing what’s said.
“Moreover, we have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent,” Huang said. “Moving from recognizing to understanding speech is the next major frontier for speech technology.”
Microsoft’s main rival in the natural speech recognition space seems to be Google Inc., which introduced a service called the Cloud Speech API last April. That service, which powers the voice capabilities of Google Search, Google Assistant and Google Now, was recently updated with faster processing and support for long-form audio sessions.
Google Product Manager Dan Aharon said enterprises were already using Cloud Speech API for a range of applications.
“Among early adopters of Cloud Speech API, we have seen two main use cases emerge: speech as a control method for applications and devices like voice search, voice commands and Interactive Voice Response; and also in speech analytics,” Aharon said.
THANK YOU