CVDec 21, 2024

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

arXiv:2412.16563v331 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses the problem of generating realistic and semantically rich co-speech gestures for applications like animation or virtual avatars, representing an incremental improvement over existing methods.

The paper tackles co-speech motion generation by integrating common rhythmic and rare semantic motions, proposing SemTalk to separately learn and adaptively fuse base and sparse motions, resulting in outperforming state-of-the-art methods on two public datasets with enhanced semantic richness.

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes