#EMO #Alibaba #breathes #life #portraits #tool
Just days after OpenAI surprised the world with its text-to-video tool called “Sora,” a new AI tool called “EMO” (Emote Portrait Alive) has appeared. Researchers at China’s Alibaba have unveiled a new generative AI tool that pushes the limits of realism in animating subjects.
The system seamlessly animates a lifelike video from a single photo, where the subject can speak or even sing, all controlled by an audio input. The solution, detailed in a research paper published on arXiv, represents a leap forward in the production of voice-controlled video, a long-standing challenge for AI experts. EMO outperforms traditional methods, which often struggle to reconstruct the complexity and nuances of human expressions.
“Capturing the full spectrum of human facial expressions and the uniqueness of individual facial styles has always been an obstacle. EMO solves this with a novel approach to audio-video synthesis, eliminating the need for complex 3D models or facial features.”
He told Linrui Tian lead author on the study.
6. Audrey Hepburn singing Ed Sheeran cover pic.twitter.com/QQFPVUg5zK
— Min Choi (@minchoi) February 28, 2024
Behind the magic of artificial intelligence
How does EMO do all this? Basically, the application uses a popular technique known as the diffusion model, which is already well known for producing highly realistic artificial images. Alibaba researchers trained this model on a huge data set. There are more than 250 hours of curated videos from various sources. Sources include speeches, films, television and musical performances.
While previous methods rely heavily on 3D modeling or blurring of shapes to simulate facial movement, EMO takes a more direct approach. It converts sound waves directly into video images, resulting in remarkably natural animations that capture each person’s subtle mannerisms and individual quirks.
From talking heads to singing portraits
Rigorous tests described in the research paper show that EMO significantly outperforms current best-in-class systems in terms of video quality, identity preservation, and the ability to convey emotions that don’t stop at conversation. It animates portraits into lively singing videos, perfectly synchronizing mouth movements and facial expressions to the soundtrack. The system can create videos of any length, which makes it extremely versatile.
This is mind blowing.
This AI can make single image sing, talk, and rap from any audio file expressively! 🤯
Introducing EMO: Emote Portrait Alive by Alibaba.
10 wild examples: 🧵👇
1. AI Lady from Sora singing Dua Lipa pic.twitter.com/CWFJF9vy1M
— Min Choi (@minchoi) February 28, 2024
“EMO proves that it can produce convincing speech videos and even generate vocal videos in different styles. This is a significant step forward.”
is in the study.
Although such concepts and tools have existed before, the results created by EMO are commendable. Social media could take over this feature, where photos and audio samples could be used to create highly personalized videos. Additional opportunities may include bringing many historical figures to life and other creative opportunities for arts and entertainment.
As for the worrying part, the use of such advanced technology reasonably raises ethical challenges. If misused, they can create deepfakes that can spread false information, impersonate individuals without their consent, and create further mistrust. Researchers are already investigating detection methods to combat this, emphasizing the need for responsible use of such high-powered artificial intelligence systems.
Tools like EMO underscore the urgent need to develop methods to distinguish these synthetic videos from the real thing.
(source)