CVOct 31, 2024

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Xiang Deng, Youxin Pang, Xiaochen Zhao, Chao Xu, Lizhen Wang, Hongjiang Xiao, Shi Yan, Hongwen Zhang, Yebin Liu

arXiv:2410.23836v17.64 citationsh-index: 31IEEE Trans Pattern Anal Mach Intell

Originality Incremental advance

AI Analysis

This addresses the problem of creating realistic and controllable talking human videos for applications like virtual avatars or entertainment, though it builds incrementally on existing audio-driven synthesis methods.

The paper tackles audio-driven 3D human video synthesis by introducing Stereo-Talker, a system that generates talking videos with precise lip sync, expressive gestures, and viewpoint control, using a two-stage approach with LLM priors and a Mixture-of-Experts mechanism, and it includes a dataset of 2,203 identities for broad generalization.

This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

View on arXiv PDF

Similar