CVHCROOct 9, 2025

A Multimodal Depth-Aware Method For Embodied Reference Understanding

arXiv:2510.08278v2h-index: 33
Originality Incremental advance
AI Analysis

This work addresses embodied reference understanding for robotics or human-computer interaction, but it appears incremental as it builds on prior methods with added depth integration.

The paper tackled the problem of ambiguous object identification in visual scenes using language and pointing cues by proposing a multimodal depth-aware framework, which significantly outperformed existing baselines on two datasets for more accurate referent detection.

Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes