CVOct 31, 2024

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

arXiv:2410.23836v14 citationsh-index: 31IEEE Trans Pattern Anal Mach Intell
Originality Incremental advance
AI Analysis

This addresses the problem of creating realistic and controllable talking human videos for applications like virtual avatars or entertainment, though it builds incrementally on existing audio-driven synthesis methods.

The paper tackles audio-driven 3D human video synthesis by introducing Stereo-Talker, a system that generates talking videos with precise lip sync, expressive gestures, and viewpoint control, using a two-stage approach with LLM priors and a Mixture-of-Experts mechanism, and it includes a dataset of 2,203 identities for broad generalization.

This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes