CVMar 28, 2025

Understanding Co-speech Gestures in-the-wild

Oxford
arXiv:2503.22668v22 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the challenge of non-verbal communication analysis for applications in human-computer interaction and social robotics, representing an incremental advance through novel tasks and benchmarks.

The paper tackles the problem of understanding co-speech gestures in unconstrained settings by introducing a new framework with three tasks and benchmarks, and demonstrates that their tri-modal representation approach outperforms previous methods, including large vision-language models, in learning gesture-speech-text associations.

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-speech-text associations: (i) gesture based retrieval, (ii) gesture word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal video-gesture-speech-text representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs). Further analysis reveals that speech and text modalities capture distinct gesture related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes