Google researchers develop new diffusion-based AI system for generating videos
Google LLC has revealed Lumiere, an artificial intelligence system for generating videos that the company says can outperform earlier models in the category.
The Alphabet Inc. unit detailed the system in a research paper published on Tuesday. According to Google, Lumiere is capable of generating five-second videos with a resolution of 1024 by 1024 pixels. It can create clips based on a text prompt or image provided by the user, as well as edit existing clips.
Under the hood, Lumiere comprises two separate AI models. The first, which Google’s researchers have named Space-Time U-Net, generates an initial low-resolution clip based on the user’s prompt. The second AI upgrades the resolution of that clip to produce the final 1024-by-1024-pixel version Lumiere provides as output.
According to Google, the Space-Time U-Net model that creates the initial low-resolution video is based on a so-called diffusion architecture. This architecture underpins many of the most popular AI image generators on the market.
What sets diffusion models apart from other neural networks is the way they’re trained. During training, a diffusion model is given a set of photos that contain a type of error known as Gaussian noise. It must then remove the error to recreate the original photos, a process through which it learns how to generate entirely new images from scratch.
Google’s researchers based Space-Time U-Net on an existing, open-source diffusion model. They customized it by adding software components that can decrease and increase the quality of images. They also equipped the model with support for attention, a machine learning technique that allows neural networks to filter the information they consider when making decisions and disregard irrelevant data points.
Space-Time U-Net is not the only AI model capable of generating videos. However, it takes a different approach to the task than earlier neural networks in the category.
A video is a set of images, or frames, displayed one after one another. AI systems typically generate that sequence of frames in two phases. They create the first and last images in the subsequence, then add the remaining frames.
Space-Time U-Net goes about the task differently. Instead of generating a clip’s frames piecemeal, it creates the entire clip in one pass. Google says this approach allows Lumiere to generate higher-fidelity videos than many existing neural networks.
The company evaluated the AI system’s capabilities by having it generate a series of five-second clips based on 113 different prompts. For added measure, it also used a benchmark dataset called UCF101 in the assessment. Researchers determined that Lumiere achieved “state-of-the-art video generation results” compared with other neural networks designed for the same task.
Besides generating video content based on text prompts, Lumiere can also create clips in the style of a reference image uploaded by the user. It’s likewise capable of modifying existing clips. Furthermore, Google says, Lumiere can create animations called cinemagraphs in which only some elements move while the others remain still.
Image: Google
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU