Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching
This work provides a real-time solution for high-fidelity avatar generation, making it accessible for broader applications, though it appears incremental by improving upon existing flow matching methods.
The authors tackled the problem of generating real-time audio-driven talking head videos by addressing limitations in lip-sync accuracy and pose drift, achieving a LipSync Confidence of 8.50 on the HDTF dataset and a throughput of 141 FPS with 0.17s latency on an A10 GPU.
We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/