3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control

Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Naoya Chiba, Yuki Uranishi

arXiv:2601.18451v14.02 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work improves gesture generation for applications like virtual avatars or human-computer interaction, though it appears incremental as it builds on existing methods with a novel framework.

The paper tackles the problem of generating holistic co-speech gestures that integrate full-body motion with facial expressions, addressing issues like semantic incoherence and spatial instability, and demonstrates effectiveness on the BEAT2 dataset with state-of-the-art results.

Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.

View on arXiv PDF

Similar