ROAIHCApr 12, 2024

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

arXiv:2404.08424v310 citationsh-index: 14ICSR + AI
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of enabling natural human-robot interaction through intention prediction, which is incremental as it applies existing LLMs to a new multimodal context.

The paper tackled the problem of predicting human intentions in collaborative object categorization tasks with social robots by proposing a multimodal approach integrating verbal and non-verbal cues. The evaluation of five LLMs demonstrated their potential for reasoning about these cues to support intention prediction, though no concrete performance numbers were provided.

Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with a social robot. Video: https://youtu.be/tBJHfAuzohI

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes