EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model
This work improves co-speech gesture generation for applications like virtual avatars or animation, but it is incremental as it builds on existing diffusion models with specific enhancements.
The paper tackled the problem of generating diverse and realistic co-speech 3D gestures by addressing the one-to-many mapping between speech and gestures and modeling joint correlations, resulting in a method that surpasses previous state-of-the-art approaches in gesture synthesis.
Although previous co-speech gesture generation methods are able to synthesize motions in line with speech content, it is still not enough to handle diverse and complicated motion distribution. The key challenges are: 1) the one-to-many nature between the speech content and gestures; 2) the correlation modeling between the body joints. In this paper, we present a novel framework (EMoG) to tackle the above challenges with denoising diffusion models: 1) To alleviate the one-to-many problem, we incorporate emotion clues to guide the generation process, making the generation much easier; 2) To model joint correlation, we propose to decompose the difficult gesture generation into two sub-problems: joint correlation modeling and temporal dynamics modeling. Then, the two sub-problems are explicitly tackled with our proposed Joint Correlation-aware transFormer (JCFormer). Through extensive evaluations, we demonstrate that our proposed method surpasses previous state-of-the-art approaches, offering substantial superiority in gesture synthesis.