UPDATED 18:46 EDT / DECEMBER 14 2021

MLCommons releases open-source datasets for training speech recognition models

The MLCommons Association, a nonprofit consortium that aims to improve machine learning for the public good, today announced the release of two key new datasets that it says can be leverage by organizations to develop superior artificial intelligence models.

The consortium said the People’s Speech Dataset is one of the world’s most comprehensive collections of English language speeches that’s licensed for academic and commercial use. Meanwhile, the Multilingual Spoken Words Corpus is said to be one of the largest audio speech datasets in the world, with keywords spoken in 50 languages.

What MLCommons is trying to do is level the playing field in AI development. It notes that smaller organizations have a distinct disadvantage when trying to develop models for speech recognition, because the most comprehensive datasets available have always had high licensing costs. Added to that, tech giants such as Google LLC and Apple Inc. can gather large amounts of free training data through devices such as smartphones.

MLCommons points out that when researchers from the Mozilla Foundation began developing its DeepSpeech English language speech recognition tool, it was forced to reach out to TV and radio stations to acquire enough public speech data to train it.

The People’s Speech Dataset is meant to remedy that problem. It provides more than 30,000 hours of supervised conversational audio released under a Creative Commons licenses, meaning it can be used to create voice recognition models that power voice assistants and transcription software.

As for the MSWC dataset, it has more than 340,000 keywords with upwards of 23.4 million examples spanning languages spoken by more than 5 billion people. MLCommons said it can be used to train machine learning models for applications such as call centers and smart devices.

Constellation Research Inc. analyst Holger Mueller said MLCommons’ datasets will be welcomed by a developer community that struggles to obtain the high-quality training data it needs to build effective AI models in speech recognition. Speech data, he said, is very hard to capture due to matters around privacy and consent.

“A standardized dataset also opens things up for performance benchmarks as well, so we will see what these two datasets can do to improve the quality of AI models,” Mueller said. “Nothing improves AI quality more than competitions based on standardized datasets.”

Both of the datasets come with permissive licensing terms, including commercial fair use, which is not allowed with many other speech training libraries.

Keith Achorn, a machine learning engineer at Intel Corp. who helped oversee the curation of the datasets, said the hope is that it will help more developers to build speech recognition systems without budgetary constraints.

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

MLCommons releases open-source datasets for training speech recognition models

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Freshworks Refresh 2026

IBM Think 2026

Dell Technologies World 2026

KB4-CON 2026

VeeamON 2026

MLCommons releases open-source datasets for training speech recognition models

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Freshworks Refresh 2026

IBM Think 2026

Dell Technologies World 2026

KB4-CON 2026

VeeamON 2026

Cookies