HCROMay 18

FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

arXiv:2503.1649269.910 citationsh-index: 10Has Code
Predicted impact top 3% in HC · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the need for more intuitive and accessible HRI for users with motor impairments by combining gaze and language modalities.

FAM-HRI integrates gaze and speech inputs via foundation models for efficient human-robot interaction, achieving high task success rates with low interaction time, particularly benefiting users with physical impairments.

ffective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gesture- only or language-only commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM-HRI, an efficient multimodal framework for HRI that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real-time multimodal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines the gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM-HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments. To support the community, we have released our system design, algorithms, and solutions at https://github.com/laiyuzhi/FAM-HRI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes