CVMar 24, 2022

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

arXiv:2203.13161v1149 citationsh-index: 72
Originality Incremental advance
AI Analysis

This work addresses the challenge of fine-grained gesture generation in virtual avatar creation, offering an incremental improvement over holistic synthesis approaches.

The paper tackles the problem of generating realistic co-speech gestures for virtual avatars by proposing a hierarchical framework that associates speech audio with gesture semantics at multiple granularities, resulting in improved gesture quality that outperforms previous methods.

Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin. Project page: https://alvinliu0.github.io/projects/HA2G

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes