CVNov 24, 2024

Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance

Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, Yaling Liang

arXiv:2411.16748v45.21 citationsh-index: 25

Originality Incremental advance

AI Analysis

This addresses the challenge of creating realistic, consistent talking videos for applications like virtual avatars or video editing, though it appears incremental as it builds on existing diffusion and transformer methods.

The paper tackles the problem of synthesizing long-duration talking videos, which suffer from visual degradation and inconsistency over time, by proposing LetsTalk, a diffusion transformer framework with multimodal guidance and a memory bank mechanism. The result is state-of-the-art generation quality with 8x fewer parameters than previous approaches.

Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait and temporal consistency, and computational efficiency. As video length increases, issues such as visual degradation, identity inconsistency, temporal incoherence, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal consistency, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.

View on arXiv PDF

Similar