Meta announces new breakthroughs in AI image editing and video generation with Emu
Artificial intelligence researchers from Meta Platforms Inc. said today they have made significant advances in AI-powered image and video generation.
The Facebook and Instagram parent has developed new tools that enable more control over the image editing process via text instructions, and a new method for text-to-video generation. The new tools are based on Meta’s Expressive Media Universe or Emu, the company’s first foundational model for image generation.
EMU was announced in September and today it’s being used in production, powering experiences such as Meta AI’s Imagine feature that allows users to generate photorealistic images in Messenger. In a blog post, Meta’s AI researchers explained that generative AI image generation is often a step-by-step process, where the user tries a prompt and the picture that’s generated isn’t quite what they had in mind. As a result, users are forced to keep tweaking the prompt until the image created is closer to what they had imagined.
Emu Edit for image editing
What Meta wants to do is to eliminate this process and give users more precise control, and that’s what its new Emu Edit tool is all about. It offers a novel approach to image manipulation, where the user simply inputs text-based instructions. It can perform local and global editing, adding or removing backgrounds, color and geometry transformations, object detection, segmentation and many more editing tasks.
“Current methods often lean toward either over-modifying or under-performing on various editing tasks,” the researchers wrote. “We argue that the primary objective shouldn’t just be about producing a ‘believable’ image. Instead, the model should focus on precisely altering only the pixels relevant to the edit request.”
To that end, Emu Edit has been designed to follow the user’s instructions precisely to ensure that pixels unrelated to the request are untouched by the edit made. As an example, if a user wants to add the text “Aloha!” to a picture of a baseball cap, the cap itself should not be altered.
The researchers said incorporating computer vision into instructions for image generation models allows it to give users unprecedented control in image editing.
Emu Edit was trained on a dataset that contains 10 million synthesized samples, with each one including an input image, a description of the task to be performed and the targeted output image. The researchers believe this is the largest dataset of its kind ever created, allowing Emu Edit to deliver unrivaled results in terms of instruction faithfulness and image quality.
Emu Video for video generation
Meta’s AI team has also been focused on enhancing video generation. The researchers explained that the process of using generative AI to create videos is actually similar to image generation, only it involves bringing those images to life by bringing movement into the picture.
The Emu Video tool leverages the Emu model and provides a simple method for text-to-video generation that’s based on diffusion models. Meta said the tool can respond to various inputs, including text only, image only or both together.
The video generation process is split into a couple of steps, the first being to create an image conditioned by a text prompt, before creating a video based on that image and another text prompt. According to the team, this “factorized” approach offers an extremely efficient way to train video generation models.
“We show that factorized video generation can be implemented via a single diffusion model,” the researchers wrote. “We present critical design decisions, like adjusting noise schedules for video diffusion, and multi-stage training that allows us to directly generate higher-resolution videos.”
Meta said the advantage of this new approach is that it’s simpler to implement, using just a pair of diffusion models to whip up a 512-by-512 four-second video at 16 frames per second, compared with its older Make-A-Video tool, which uses five models. The company says human evaluations of this work reveal that it’s “strongly preferred” over its earlier work in image generation in terms of its overall quality and its faithfulness to the original text prompt.
Emu Video boasts other capabilities too, including the ability to animate user’s images based on simple text prompts, and once again it outperforms its earlier work.
For now, Meta’s research into generative AI image editing and video generation remains ongoing, but the team stressed there are a number of exciting use cases for the technology. For instance, it can enable users to create their own animated stickers and GIFs on the fly, rather than searching for existing ones that match the idea they’re trying to convert. It can also enable people to edit their own photographs without using complicated tools such as Photoshop.
The company added that its latest models are unlikely to replace professional artists and animators anytime soon. Instead, their potential lies in helping people to express themselves in new ways.
Images: Meta AI
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU