Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
This work addresses the challenge of creating natural, real-time talking head animations for applications like virtual avatars or video conferencing, though it appears incremental by building on existing motion generation techniques.
The authors tackled the problem of generating realistic, real-time audio-driven portrait animations by introducing Teller, an autoregressive framework that decomposes facial and body motion into components for efficient generation. The result is a method that achieves up to 25 FPS streaming performance, significantly faster than diffusion-based models (0.92s vs. 20.93s per second of video), and outperforms recent models in human evaluations for quality and realism.
In this work, we introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head. Beyond the challenge of lengthy animation times, a critical challenge in realistic talking head generation lies in preserving the natural movement of diverse body parts. To this end, we propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation. Specifically, Teller first decomposes facial and body detail animation into two components: Facial Motion Latent Generation (FMLG) based on an autoregressive transfromer, and movement authenticity refinement using a Efficient Temporal Module (ETM).Concretely, FMLG employs a Residual VQ model to map the facial motion latent from the implicit keypoint-based model into discrete motion tokens, which are then temporally sliced with audio embeddings. This enables the AR tranformer to learn real-time, stream-based mappings from audio to motion. Furthermore, Teller incorporate ETM to capture finer motion details. This module ensures the physical consistency of body parts and accessories, such as neck muscles and earrings, improving the realism of these movements. Teller is designed to be efficient, surpassing the inference speed of diffusion-based models (Hallo 20.93s vs. Teller 0.92s for one second video generation), and achieves a real-time streaming performance of up to 25 FPS. Extensive experiments demonstrate that our method outperforms recent audio-driven portrait animation models, especially in small movements, as validated by human evaluations with a significant margin in quality and realism.