UPDATED 18:46 EDT / DECEMBER 14 2021

MLCommons releases open-source datasets for training speech recognition models

The MLCommons Association, a nonprofit consortium that aims to improve machine learning for the public good, today announced the release of two key new datasets that it says can be leverage by organizations to develop superior artificial intelligence models.

The consortium said the People’s Speech Dataset is one of the world’s most comprehensive collections of English language speeches that’s licensed for academic and commercial use. Meanwhile, the Multilingual Spoken Words Corpus is said to be one of the largest audio speech datasets in the world, with keywords spoken in 50 languages.

What MLCommons is trying to do is level the playing field in AI development. It notes that smaller organizations have a distinct disadvantage when trying to develop models for speech recognition, because the most comprehensive datasets available have always had high licensing costs. Added to that, tech giants such as Google LLC and Apple Inc. can gather large amounts of free training data through devices such as smartphones.

MLCommons points out that when researchers from the Mozilla Foundation began developing its DeepSpeech English language speech recognition tool, it was forced to reach out to TV and radio stations to acquire enough public speech data to train it.

The People’s Speech Dataset is meant to remedy that problem. It provides more than 30,000 hours of supervised conversational audio released under a Creative Commons licenses, meaning it can be used to create voice recognition models that power voice assistants and transcription software.

As for the MSWC dataset, it has more than 340,000 keywords with upwards of 23.4 million examples spanning languages spoken by more than 5 billion people. MLCommons said it can be used to train machine learning models for applications such as call centers and smart devices.

Constellation Research Inc. analyst Holger Mueller said MLCommons’ datasets will be welcomed by a developer community that struggles to obtain the high-quality training data it needs to build effective AI models in speech recognition. Speech data, he said, is very hard to capture due to matters around privacy and consent.

“A standardized dataset also opens things up for performance benchmarks as well, so we will see what these two datasets can do to improve the quality of AI models,” Mueller said. “Nothing improves AI quality more than competitions based on standardized datasets.”

Both of the datasets come with permissive licensing terms, including commercial fair use, which is not allowed with many other speech training libraries.

Keith Achorn, a machine learning engineer at Intel Corp. who helped oversee the curation of the datasets, said the hope is that it will help more developers to build speech recognition systems without budgetary constraints.

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Support our open free content by sharing and engaging with our content and community.

Join theCUBE Alumni Trust Network

Where Technology Leaders Connect, Share Intelligence & Create Opportunities

11.4k+

CUBE Alumni Network

C-level and Technical

Domain Experts

15M+

theCUBE

Viewers

Connect with 11,413+ industry leaders from our network of tech and business leaders forming a unique trusted network effect.

SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

MLCommons releases open-source datasets for training speech recognition models

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

Open Source Summit NA 2025

Data Protection & AI Summit

AWS & Ecosystem Leaders Halftime Report - 2025

Black Hat USA 2025

VMware Explore 2025

RECENT CUBE EVENTS

AppDev Done Right Summit 2025

Broadcom Delivers the Modern Private Cloud 2025

Databricks Data + AI Summit 2025

AWS Summit Washington, DC 2025

Google Cloud Partner AI Series 2025

MLCommons releases open-source datasets for training speech recognition models

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Join theCUBE Alumni Trust Network

LATEST STORIES

LATEST STORIES

Open Source Summit NA 2025

Data Protection & AI Summit

AWS & Ecosystem Leaders Halftime Report - 2025

Black Hat USA 2025

VMware Explore 2025

AppDev Done Right Summit 2025

Broadcom Delivers the Modern Private Cloud 2025

Databricks Data + AI Summit 2025

AWS Summit Washington, DC 2025

Google Cloud Partner AI Series 2025

Cookies