AIDec 2, 2024

Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, Yiming Wang

arXiv:2412.01250v312.510 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the challenge of human burden in providing detailed descriptions for embodied agents in navigation tasks, though it appears incremental as it builds on existing language-driven methods with a focus on interaction.

The paper tackles the problem of reducing human effort in language-driven instance object navigation by introducing a collaborative task setting where the agent actively resolves uncertainties through dialogues, and proposes a training-free method that minimizes user input while achieving competitive performance on a new benchmark for complex multi-instance scenes.

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes. Code and benchmark will be available upon acceptance at https://intelligolabs.github.io/CoIN/

View on arXiv PDF

Similar