UPDATED 11:40 EDT / FEBRUARY 16 2024

AI

Meta unveils V-JEPA AI model that improves training by learning from video

Meta Platform Inc.’s AI research division has released a new artificial intelligence model today that makes crucial steps in AI training that advances learning by interpreting video information similar to the way that humans understand the world.

The model, named V-JEPA, or the Video Joint Embedding Predictive Architecture model, works differently than large language models. It uses images rather than words and it is a nongenerative model, meaning that it doesn’t use the entire image all at once. A generative model would attempt to compare every pixel in every frame to every pixel, whereas V-JEPA attempts to use abstract concepts such as trees, people, animals and objects along with their relationships to each other to learn.

The project was spearheaded by Meta Vice President and Chief AI Scientist Yann LeCun, who proposed the original JEPA model in 2022.

“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” LeCun said. “Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

Because V-JEPA doesn’t need to ingest and analyze every pixel for every frame of a video, the model can improve training efficiency by a factor of 1.5 to six times, the researchers said. It can also be trained entirely with unlabeled data. Labels are only required to prepare the model for a particular task after pretraining. That means it can be prepared with videos before curation of the objects and subjects in the data are labeled, so labeling isn’t a bottleneck.

As part of model training, large portions of a video are “masked out,” or hidden from it, during training. That means it has to predict what’s happening underneath the hidden sections. It’s similar to the way a human infant might learn when a person leaves the field of view to retrieve a ball and then returns. That allows the model to develop a logical understanding of how objects interact.

The researchers said the model works best with “fine-grained object interactions and distinguishing detailed object-to-object interactions that happen over time.” As an example, the model can tell the difference between someone passing by to pick up a pen, putting it down or picking it up and pretending to put it down but not doing it. However, it’s only good at short time scales, maybe around 10 seconds – increasing the time that the model can make predictions is the next step.

Right now, the “V” in V-JEPA only stands for “video,” which means that the model can only address the video content of videos. It’s not capable of understanding audio spoken in videos. The researchers said they’re considering adding this capability in the future to add more context to its capabilities.

The Meta researchers said that with its capabilities of being able to observe abstract visual activity, the model will open up opportunities for future embodied AI, such as much smarter AI agents that behave even more like people.

As mixed and augmented reality becomes more mainstream and AI agents can see what people are doing and appear as lifelike characters in people’s living rooms, being able to watch someone chop vegetables or interact with their smart TV could make them superior helpers. For example, if someone is attempting to cook a favorite meal and picks the wrong ingredient from the cupboard, V-JEPA could quickly tell and help find the right one.

To help researchers and developers build on the new model and start developing quickly, the Meta researchers said, they’re releasing the new model under a Creative Commons license so it can be extended. It’s available today in a GitHub public repository.

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU