CVJun 27, 2025

MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

arXiv:2506.22065v13 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the challenge of generating realistic, synchronized animations from audio for applications like virtual avatars, with incremental improvements in efficiency and control.

The paper tackles the problem of real-time, high-fidelity audio-driven portrait animation by introducing MirrorMe, a framework that achieves state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability on the EMTD Benchmark.

Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX's temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes