UPDATED 21:44 EST / APRIL 18 2024

Microsoft’s new VASA-1 AI framework generates super-realistic talking heads that can even sing songs

Microsoft Corp. has published a research paper that introduces a new kind of artificial intelligence framework that makes it possible to upload a still photo, add a voice sample and create a super-realistic talking head that looks and sounds like the real person.

The new framework is called VASA-1, and it takes a single, portrait style image and an audio file and merges them together in such a way that it can create a short video of a talking head with realistic facial expressions, head movements and even the ability to sing songs in the uploaded voice.

Microsoft said VASA-1 is currently only a research project, so it’s not making it available for anyone else to use, but it posted a number of demonstration videos with dazzling realism.

Although Nvidia Corp. and Runway AI Inc. have both released similar technology, VASA-1 seems to be able to create much more realistic talking heads, with reduced mouth artifacts.

The company said the new framework is specifically designed for the purpose of animating virtual characters, and so all of the individuals in its examples are synthetic, generated using OpenAI’s DALL-E image generating model. However, it clearly has the potential to go further, because if it’s possible to animate an AI image, it should be just as easy to animate a photo of a real person.

In the demo, the talking heads appear to be real individuals that were filmed, with smooth, natural-looking movements. The lip-sync capabilities are especially impressive, and it’s very difficult to discern any unnatural-looking movements.

Equally impressive is that VASA-1 doesn’t seem to require a traditional, face-forward, passport or portrait style image to work. In the examples there are shots of heads facing in slightly different directions. The model also offers a high level of control, using things such as eye gaze direction, head distance and even emotional expressions as inputs, adding to the realism.

Big potential and big risks

In terms of practical applications, one of the most obvious use cases would be video games. VASA-1 could enable developers to create more realistic AI-generated characters with extremely natural lip syncing movements and facial expressions, boosting immersion. The technology could also be used to create avatars in social media videos, and perhaps even go further and enable more realistic AI-generated movies or music videos where it genuinely appears as if the actor, actress or singer is really talking or singing.

Besides its ability to lip-sync talking heads perfectly with an uploaded song, VASA-1 can also handle nonhuman images, including the Mona Lisa rapping the words of Paparazzi:

Microsoft just dropped VASA-1.

This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba

10 wild examples:

1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD

— Min Choi (@minchoi) April 18, 2024

That said, just as there is potential for creativity, there is undoubtedly potential for this technology to be misused. VASA-1 would certainly make the life of anyone invested in creating deepfake videos much easier. For instance, someone could upload a headshot of Donald Trump, followed by a short audio clip of his voice, then create a realistic video of him saying whatever they want him to say.

The risk of misuse explains why Microsoft is being so guarded about the project. “Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications,” Microsoft’s researchers said. “It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans.”

As such, the company said there are no plans to release an online demo, product or additional implementation details at present, adding that it will only consider doing so when it’s certain that the technology will be used responsibly.

Images: Microsoft

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Microsoft’s new VASA-1 AI framework generates super-realistic talking heads that can even sing songs

Big potential and big risks

Images: Microsoft

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

MWC Barcelona 2026

Vast Forward 2026

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

Microsoft’s new VASA-1 AI framework generates super-realistic talking heads that can even sing songs

Big potential and big risks

Images: Microsoft

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

MWC Barcelona 2026

Vast Forward 2026

CES 2026

AWS re:Invent 2025

Microsoft Ignite 2025

Cookies