CLMay 28

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

arXiv:2605.3060889.42 citationsh-index: 20
AI Analysis

This work provides a novel approach to improve the retrieval and generation of semantically meaningful co-speech gestures, which is important for applications like virtual assistants and human-computer interaction.

This paper addresses the challenge of aligning spoken text with semantically meaningful co-speech gestures by proposing semantic motion anchors, which are natural-language abstractions of gesture motion. The method discretizes 3D gestures into body-hand motion primitives, verbalizes them, and grounds them in the transcript, leading to an 8.2% improvement in text-to-gesture R@1 on BEAT2 compared to a direct text-motion baseline.

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes