CVHCROJun 25, 2025

How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

arXiv:2506.20795v11 citationsh-index: 19RO-MAN
Originality Incremental advance
AI Analysis

This addresses the problem of reducing system complexity in gesture recognition for human-robot communication, particularly in noisy industrial environments, but is incremental as it builds on existing models.

The study tackled gesture recognition for human-robot interaction by comparing foundation models (V-JEPA and Gemini Flash 2.0) with a skeleton-based approach (HD-GCN) on a new dataset, finding that HD-GCN performed best but V-JEPA came close with a simple adaptation, while Gemini struggled in zero-shot settings.

Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes