ROAIApr 13

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

arXiv:2604.1141743.91 citationsh-index: 15
Predicted impact top 52% in RO · last 90 daysOriginality Incremental advance
AI Analysis

It addresses the need for semantic gesture generation in human-robot interaction, offering a computationally efficient alternative to larger models.

The paper proposes a lightweight transformer for predicting iconic co-speech gestures from text and emotion, achieving superior performance over GPT-4o on the BEAT2 dataset in gesture placement and intensity regression, while being suitable for real-time deployment.

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes