UPDATED 11:00 EDT / OCTOBER 19 2020

Facebook open-sources its M2M-100 multilingual model to improve translation accuracy

Facebook Inc. said today it has made substantial progress in using machine learning to generate more accurate translations between any two languages without relying on English-language data.

The company is open-sourcing its latest creation, M2M-100, which it says is the first multilingual machine translation model that can translate directly between any pair of 100 languages.

Up until now, most multilingual machine translation models have relied on the English language as a kind of intermediary when translating between two different languages, because of the wide availability of English training data. So, for example, when a model translates a sentence from French to Chinese, it would first translate the French into English, and from there into Chinese. These models work well enough in most cases but are often inaccurate when it comes to more complex sentences and phrases.

Facebook said M2M-100 can better preserve meaning by translating directly from Chinese to French, or between any of more than 100 language pairs, without using English as an intermediary.

Translating between so many different language pairs is not an easy task, since the models need access to lots of high-quality training data. In a blog post, Facebook AI researcher Angela Fan explained how she and her team set about creating a massive “many-to-many” dataset containing more than 7.5 billion sentences in 100 different languages. This data was gathered using open-source data mining tools such as ccAligned, ccMatrix and LASER, and then split into 14 distinct language groups based on parameters such as linguistic classification, geography and cultural similarities.

“People living in countries with languages of the same family tend to communicate more often and would benefit from high quality translations,” Fan said. “For instance, one group would include languages spoken in India, like Bengali, Hindi, Marathi, Nepali, Tamil or Urdu. We systematically mined all possible language pairs within each group.”

Within each of those 14 language groups, Facebook then identified one to three “bridge languages” for each one to serve as the basis of its translations into different language groups.

“We then mined parallel training data for all possible combinations of these bridge languages,” Fan said. “Using this technique, our training dataset ended up with 7.5 billion parallel sentences of data, corresponding to 2,200 directions. Since the mined data can be used to train two directions of a given language pair (e.g. en->fr and fr->en), our mining strategy helps us effectively sparsely mine to best cover all 100×100 (a total of 9,900) directions in one model.”

Fan’s team also used a technique known as “back translation” to create synthetic data to the supplement the already mined parallel data.

“Overall, the combination of our bridge strategy and back translated data improved performance on the 100 back translated directions by 1.7 BLEU on average compared to training on mined data alone,” Fan said. “With a more robust, efficient, and high-quality training set, we were well-equipped with a strong foundation for building and scaling our many-to-many model.”

Fan said the finished M2M-100 model is able to translate with much greater accuracy than the existing English-centric multilingual models that Facebook currently uses, outperforming those systems by 10 points on the widely used BLEU metric scale used to evaluate machine translations. Facebook ultimately wants to replace those models with M2M-100 in order to improve the quality of translations for its millions of users that speak low-resource languages.

“We’ll continue to improve our model by incorporating such cutting-edge research, exploring ways to deploy MT systems responsibly, and creating more specialized computation architectures necessary to bring this to production,” Fan said.

Image: Facebook

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Facebook open-sources its M2M-100 multilingual model to improve translation accuracy

Image: Facebook

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

Oracle Data Deep Dive NYC 2026

HPE World Quantum Day 2026

Qlik Connect 2026

Nutanix .NEXT 2026

KubeCon + CloudNativeCon EU 2026

Facebook open-sources its M2M-100 multilingual model to improve translation accuracy

Image: Facebook

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

Oracle Data Deep Dive NYC 2026

HPE World Quantum Day 2026

Qlik Connect 2026

Nutanix .NEXT 2026

KubeCon + CloudNativeCon EU 2026

Cookies