CVAISep 3, 2024

CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

arXiv:2409.01876v338 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the underexplored challenge of audio-driven human animation for applications in video generation, though it appears incremental as it builds on existing diffusion-based video generation technology.

The paper tackles the problem of cross-modality human body animation by introducing CyberHost, an end-to-end audio-driven framework that ensures hand integrity, identity consistency, and natural motion, surpassing previous works in quantitative and qualitative aspects.

Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes