UPDATED 11:30 EDT / JUNE 04 2021

AI

Facebook open-sources Flores-101 dataset to enable more accurate AI translations

Facebook Inc. today open-sourced a dataset called Flores-101 for use in the development of artificial intelligence models that translate text between different languages.

Building an AI model involves training a neural network on a large amount of information until it learns to identify useful patterns. Afterwards, developers check whether the AI generates sufficiently accurate results to be used in production by having it process a test database. Flores-101 is a test dataset for evaluating translation models that contains sentences translated across 101 languages.

The Facebook researchers who worked on Flores-101 say it addresses a major gap in the AI ecosystem. Measuring AI accuracy is an essential part of machine learning projects because without the ability to evaluate processing results reliably, developers can’t determine if a tweak to a model increased or decreased its performance.

However, the test datasets commonly used to perform evaluations for the most part cover only a limited number of widely spoken languages such as English and Spanish. As a result, developers building AI software for translating between other languages often face challenges in assessing their models’ accuracy. 

“Imagine trying to bake a cake but not being able to taste it,” the Flores-101 team explained in a blog post. “It’s near impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.”

Flores-101 consists of text blocks extracted from news articles, travel guides and other sources that have been translated across 101 languages. For more than 80% of those languages, there was previously only a limited number of AI training datasets available or none at all, Facebook’s researchers said.

Over recent years, computer scientists have sought to make AI translation models more accurate by configuring them to analyze words and sentences in the context of the surrounding text. According to Facebook, Flores-101 can support projects that take this approach. “FLORES is constructed to translate multiple adjacent sentences from  selected documents, meaning models can measure whether document level context improves translation quality,” the company’s researchers wrote.

For added measure, the social network also included metadata clues alongside the sentences such as tags explaining the topic of each text bloc. Such information can help machine learning infer the meaning of sentences more easily, which in turn improves the quality of translations.

Facebook assembled the text that makes up Flores-101 through a multistage process. First, the company asked a team of professional translators to translate each piece of text into the supported languages. Then, an editor reviewed each document for errors before handing it over to yet another team of translators, who finalized the dataset. 

“Good benchmarks are difficult to construct,” Facebook’s researchers said. “They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.”

“Efforts like FLORES are of immense value, because they not only draw attention to under served languages, but they immediately invite and actively facilitate research on all these languages,” commented Antonios Anastasopoulos, an assistant professor at George Mason University’s Department of Computer Science.

To encourage the development of AI translation models that support languages for which there are currently limited training datasets available, Facebook has launched a collaboration with Microsoft Corp. and the Workshop on Machine Translation. As part of the initiative, Facebook is sponsoring grants that will enable researchers to use graphics processing units in Microsoft Corp.’s Azure cloud platform for their projects. The social network says the grants will provide “thousands of GPU hours” at no charge. 

Photo: Eston Bond/Flickr

A message from John Furrier, co-founder of SiliconANGLE:

Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and soon to be Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

Join Our Community 

We are holding our second cloud startup showcase on June 16. Click here to join the free and open Startup Showcase event.

 

“TheCUBE is part of re:Invent, you know, you guys really are a part of the event and we really appreciate your coming here and I know people appreciate the content you create as well” – Andy Jassy

We really want to hear from you. Thanks for taking the time to read this post. Looking forward to seeing you at the event and in theCUBE Club.