UPDATED 12:00 EDT / JANUARY 20 2022

AI

Meta AI creates algorithm that can learn from speech, text and vision

Meta Platforms Inc.’s artificial intelligence researchers have come up with what they say is the world’s first high-performance, self-supervised algorithm that can train AI in multiple modalities, be it speech, vision or text.

The algorithm is called data2vec. Meta said in a blog post today it will enable the creation of smarter AI that can learn more generally and perform multiple tasks, including those that are unfamiliar.

Meta is trying to solve one of the big constraints of self-supervised learning, which enables machines to learn by directly observing their environment, as opposed to being explicitly taught via labeled images, text or audio. Although self-supervised learning is a big improvement, enabling computers to learn by observing their environment, it remains difficult to scale because of the differences in the way algorithms use images, speech and other modalities.

For example, an algorithm that’s used to read text is trained to fill in the blanks in various random sentences. A speech model, however, needs to learn an inventory of basic sounds in order to predict any missing sounds in a person’s speech. Meanwhile, computer vision models are usually trained to assign similar representations to a color image, of a cow perhaps, and the same image flipped upside down, so it will associate the two more closely than it would with an unrelated image of, say, a dolphin.

AI algorithms also predict different units for each modality. Image recognition involves predicting pixels or visual tokens, while text involves words and speech requires models to predict sounds from a learned inventory.

“This discrepancy has been a significant barrier to applying advances in self-supervised learning more broadly,” Meta’s AI researchers said. “Because a powerful algorithm designed for, say, understanding images can’t be directly applied to another modality, such as text, it is difficult to push several modalities ahead at the same rate.”

data2vec overcomes this by teaching AI models to predict their own representations of the input data regardless of what modality it is. By focusing on those representations instead of the usual words, sounds or visual tokens, data2vec can work with multiple types of input data.

Meta said it tested data2vec on the popular ImageNet computer vision benchmark and found it performed better than any existing algorithm. For speech, it was superior to Meta’s own wav2vec 2.0 self-supervised speech algorithm, while for text it was tested on the GLUE benchmark suite and performed as well as BERT, another commonly used benchmark.

In a Facebook post, Meta Founder and Chief Executive Mark Zuckerberg described data2vec as one of the company’s most exciting breakthroughs in AI so far.

“Meta AI research built a system that learns from speech, vision and text without needing labeled training data,” Zuckerberg said. “People experience the world through a combination of sight, sound and words, and systems like this could one day understand the world the way we do. This will all eventually get built into AR glasses with an AI assistant so, for example, it could help you cook dinner, noticing if you miss an ingredient, prompting you to turn down the heat, or more complex tasks.”

Elaborating, Meta’s researchers said data2vec has great potential to help create a new breed of AI models that can learn by themselves to perform multiple different tasks, including those that are unfamiliar. So an AI would not only be able to recognize animals it has came across in its training data, but also new creatures if it’s told what they look like.

“This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread,” Meta’s AI researchers said.

Meta believes the potential of data2vec is so great that it’s sharing the code and various pre-trained models to the wider AI research community, so others can build on its work.

Image: Meta AI

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU