UPDATED 12:50 EDT / AUGUST 30 2024

Alibaba announces Qwen2-VL AI model with advanced video analysis and reasoning capabilities

Alibaba Cloud, the cloud computing arm of China’s Alibaba Group Ltd., announced Thursday the release of a new artificial intelligence model named Qwen2-VL capable of advanced vision comprehension and multilingual conversational capabilities.

The company, which has been working on the new model for a year to produce the new model based on the Qwen-VL AI model, said it can achieve understanding of high-quality videos of more than 20 minutes in length.

According to Alibaba, it can summarize video content, answer questions related to it, and maintain a continuous flow of conversation in real-time, as well as live chat support. As a result, it can act as a personal assistant, using information drawn directly from video content.

In an example, the model was given a video of what appeared to be a short documentary clip for the International Space Station, including a scene of the control center and a shot of two astronauts speaking from within a capsule while floating in space.

It’s not perfect. When asked to summarize the scene the model responded with a clear output including descriptions of the individuals speaking, the control room and “the men appear to be astronauts, and they are wearing space suits.” The astronauts were not wearing space suits; they appeared to be wearing collared shirts and pants.

When asked what color the clothing the astronauts were wearing the model correctly answers: “The two astronauts are wearing blue and black clothes.” One man is indeed wearing a blue shirt and the other is wearing a black shirt.

The model is capable of providing a foundation for text conversational real-time live chat, where users can talk with the model and it can answer questions about a video. It is also capable of function calling and tool use based on vision, enabling it to retrieve and access external data, such as flight statuses, weather forecasts and package tracking. That would make it useful for interacting with customer service or workers in the field who could show it images of products, bar codes or other information.

Alibaba said a key improvement of the model from Qwen-VL is the continued use of the Vision Transformer model, or ViT, and the Qwen2 language model. The company said it used a ViT with about 600 million parameters to handle both image and video inputs at the same time.

The model was enhanced with the implementation of Native Dynamic Resolution support, which allows the model to handle an arbitrary number of image resolutions, an upgrade over its predecessor. And the addition of Multimodal Rotary Position Embedding system, or M-ROPE, further enables models to understand textual, 2D visual and 3D positional data at the same time.

Qwen2-VL is available in open source in two sizes under the highly permissive Apache 2.0 license with Qwen2-VL-2B and Qwen2-VL-7B. The company also released a demo running the 7 billion-parameter model on Hugging Face.

The model does have its limitations, the company noted, as it is unable to extract audio from video files, given that it’s designed only for visual reasoning. Its training is also only up to date as of June 2023 and it cannot guarantee complete accuracy for complex instructions or scenarios. However, Alibaba said that the model’s performance and visual capabilities showcased top-tier benchmarks across most metrics, even surpassing closed-sourced models such as OpenAI’s flagship GPT-4o and Anthropic PBC’s Claude 3.5-Sonnet.

The company said the Qwen2-VL family will be a stepping stone toward stronger vision language models. They will integrate more features on the path toward an “omni” model that will be able to reason across both vision and audio.

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Alibaba announces Qwen2-VL AI model with advanced video analysis and reasoning capabilities

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Alibaba announces Qwen2-VL AI model with advanced video analysis and reasoning capabilities

Image: Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Cookies