Audio-driven Gesture Generation via Deviation Feature in the Latent Space
This work addresses the problem of generating realistic co-speech gestures for video production, offering an incremental advance through weakly supervised learning.
The paper tackles co-speech gesture video generation by proposing a weakly supervised framework that learns latent representation deviations using a diffusion model, resulting in significant improvements in video quality over state-of-the-art methods.
Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.