HCLGMMAug 12, 2021

Multimodal analysis of the predictability of hand-gesture properties

arXiv:2108.05762v327 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating meaningful co-speech gestures for embodied conversational agents, though it is incremental as it builds on existing data-driven approaches.

The study investigated whether speech text and audio can predict hand-gesture properties like phase, category, and semantics using deep learning, finding that text features predict meaning-related properties while audio features better predict rhythm-related properties.

Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes