UPDATED 11:30 EDT / JUNE 04 2021

AI

Facebook open-sources Flores-101 dataset to enable more accurate AI translations

Facebook Inc. today open-sourced a dataset called Flores-101 for use in the development of artificial intelligence models that translate text between different languages.

Building an AI model involves training a neural network on a large amount of information until it learns to identify useful patterns. Afterwards, developers check whether the AI generates sufficiently accurate results to be used in production by having it process a test database. Flores-101 is a test dataset for evaluating translation models that contains sentences translated across 101 languages.

The Facebook researchers who worked on Flores-101 say it addresses a major gap in the AI ecosystem. Measuring AI accuracy is an essential part of machine learning projects because without the ability to evaluate processing results reliably, developers can’t determine if a tweak to a model increased or decreased its performance.

However, the test datasets commonly used to perform evaluations for the most part cover only a limited number of widely spoken languages such as English and Spanish. As a result, developers building AI software for translating between other languages often face challenges in assessing their models’ accuracy. 

“Imagine trying to bake a cake but not being able to taste it,” the Flores-101 team explained in a blog post. “It’s near impossible to know whether it’s any good, and even harder to know how to improve the recipe for future attempts.”

Flores-101 consists of text blocks extracted from news articles, travel guides and other sources that have been translated across 101 languages. For more than 80% of those languages, there was previously only a limited number of AI training datasets available or none at all, Facebook’s researchers said.

Over recent years, computer scientists have sought to make AI translation models more accurate by configuring them to analyze words and sentences in the context of the surrounding text. According to Facebook, Flores-101 can support projects that take this approach. “FLORES is constructed to translate multiple adjacent sentences from  selected documents, meaning models can measure whether document level context improves translation quality,” the company’s researchers wrote.

For added measure, the social network also included metadata clues alongside the sentences such as tags explaining the topic of each text bloc. Such information can help machine learning infer the meaning of sentences more easily, which in turn improves the quality of translations.

Facebook assembled the text that makes up Flores-101 through a multistage process. First, the company asked a team of professional translators to translate each piece of text into the supported languages. Then, an editor reviewed each document for errors before handing it over to yet another team of translators, who finalized the dataset. 

“Good benchmarks are difficult to construct,” Facebook’s researchers said. “They need to be able to accurately reflect meaningful differences between models so they can be used by researchers to make decisions. Translation benchmarks can be particularly difficult because the same quality standard must be met across all languages, not just a select few for which translators are more readily available.”

“Efforts like FLORES are of immense value, because they not only draw attention to under served languages, but they immediately invite and actively facilitate research on all these languages,” commented Antonios Anastasopoulos, an assistant professor at George Mason University’s Department of Computer Science.

To encourage the development of AI translation models that support languages for which there are currently limited training datasets available, Facebook has launched a collaboration with Microsoft Corp. and the Workshop on Machine Translation. As part of the initiative, Facebook is sponsoring grants that will enable researchers to use graphics processing units in Microsoft Corp.’s Azure cloud platform for their projects. The social network says the grants will provide “thousands of GPU hours” at no charge. 

Photo: Eston Bond/Flickr

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU