

The MLCommons Association, a nonprofit consortium that aims to improve machine learning for the public good, today announced the release of two key new datasets that it says can be leverage by organizations to develop superior artificial intelligence models.
The consortium said the People’s Speech Dataset is one of the world’s most comprehensive collections of English language speeches that’s licensed for academic and commercial use. Meanwhile, the Multilingual Spoken Words Corpus is said to be one of the largest audio speech datasets in the world, with keywords spoken in 50 languages.
What MLCommons is trying to do is level the playing field in AI development. It notes that smaller organizations have a distinct disadvantage when trying to develop models for speech recognition, because the most comprehensive datasets available have always had high licensing costs. Added to that, tech giants such as Google LLC and Apple Inc. can gather large amounts of free training data through devices such as smartphones.
MLCommons points out that when researchers from the Mozilla Foundation began developing its DeepSpeech English language speech recognition tool, it was forced to reach out to TV and radio stations to acquire enough public speech data to train it.
The People’s Speech Dataset is meant to remedy that problem. It provides more than 30,000 hours of supervised conversational audio released under a Creative Commons licenses, meaning it can be used to create voice recognition models that power voice assistants and transcription software.
As for the MSWC dataset, it has more than 340,000 keywords with upwards of 23.4 million examples spanning languages spoken by more than 5 billion people. MLCommons said it can be used to train machine learning models for applications such as call centers and smart devices.
Constellation Research Inc. analyst Holger Mueller said MLCommons’ datasets will be welcomed by a developer community that struggles to obtain the high-quality training data it needs to build effective AI models in speech recognition. Speech data, he said, is very hard to capture due to matters around privacy and consent.
“A standardized dataset also opens things up for performance benchmarks as well, so we will see what these two datasets can do to improve the quality of AI models,” Mueller said. “Nothing improves AI quality more than competitions based on standardized datasets.”
Both of the datasets come with permissive licensing terms, including commercial fair use, which is not allowed with many other speech training libraries.
Keith Achorn, a machine learning engineer at Intel Corp. who helped oversee the curation of the datasets, said the hope is that it will help more developers to build speech recognition systems without budgetary constraints.
Support our open free content by sharing and engaging with our content and community.
Where Technology Leaders Connect, Share Intelligence & Create Opportunities
SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.