UPDATED 09:00 EDT / APRIL 20 2022

Amazon releases ‘MASSIVE’ database to scale up natural language understanding

Amazon.com Inc. today announced the release of a massive new dataset, appropriately called “MASSIVE,” which it says can be used to build virtual assistants that support some of the world’s most obscure languages.

Alongside the database Amazon has also released open-source modeling code to help developers build more capable virtual assistants.

The MASSIVE database is what’s known as a “parallel dataset,” meaning that each of the utterances within it are given in all 51 languages it supports, including many obscure ones that lack labeled data to enable AI training.

The idea is that developers can use the MASSIVE database to train AI models to understand those more obscure languages to a similar degree that can be achieved with more common languages such as English.

The approach is known as massively multilingual natural language understanding, a paradigm that allows AI models to parse and understand inputs from many typologically diverse languages. By learning shared data representations that span multiple languages, AI models can transfer knowledge from languages where training data is abundant, to those in which data is scarce, Amazon explained.

Amazon said the MASSIVE database will be particularly useful in advancing spoken-language understanding, where audio is converted to text before NLU is performed. Virtual assistants like Amazon Alexa commonly use SLU to understand a user’s commands, but they only support a small fraction of the world’s 7,000-plus languages because of a lack of training data.

It’s hoped that MASSIVE, which more or less stands for Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation, can overcome this scarcity of data. The database contains 1 million realistic, parallel, labeled virtual assistant text utterances that span 51 languages, 18 domains, 60 intents and 55 slots. It was created by professional translators who were tasked with translating or localizing the English language dataset into 50 typologically diverse languages from 29 genera, including many low-resource languages.

Amazon said the MASSIVE dataset and tools for using it are all available from its GitHub repository starting today. In addition to launching the dataset, it has also created a competition to encourage developers to work with it. The Massively Multilingual NLU 2022 competition is hosted on eval.ai and is composed of two tasks.

The first task, MMNLU-22-Full, invites developers to train and test a single AI model on all 51 languages in the MASSIVE dataset. Having done that, developers can attempt the second task, MMNLU-22-ZeroShot, which involves fine-tuning a pretrained model only with English-labeled data and then testing it on all 50 non-English languages in MASSIVE.

“This assesses the model’s ability to generalize to new languages, an important consideration given the number of languages around the world for which there is little-to-no labeled data,” Amazon’s AI team wrote in a blog post. “Zero-shot learning is a key technology for scaling NLU technology to many more low-resource languages worldwide.”

Amazon has launched a MASSIVE leaderboard to keep track of participants in the competition, which runs until Aug. 8. Winners will then be invited to give an oral presentation of their work, either in-person or virtually, at the EMNLP 2022 conference that takes place in Abu Dhabi in December.

Image: Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Amazon releases ‘MASSIVE’ database to scale up natural language understanding

Image: Freepik

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

KB4-CON 2026

VeeamON 2026

Boomi World 2026

Red Hat Summit 2026

Securing the AI Factory with Dell Technologies and Intel 2026

Amazon releases ‘MASSIVE’ database to scale up natural language understanding

Image: Freepik

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

KB4-CON 2026

VeeamON 2026

Boomi World 2026

Red Hat Summit 2026

Securing the AI Factory with Dell Technologies and Intel 2026

Cookies