CV HC ROOct 9, 2025

A Multimodal Depth-Aware Method For Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

arXiv:2510.08278v2h-index: 33

Originality Incremental advance

AI Analysis

This work addresses embodied reference understanding for robotics or human-computer interaction, but it appears incremental as it builds on prior methods with added depth integration.

The paper tackled the problem of ambiguous object identification in visual scenes using language and pointing cues by proposing a multimodal depth-aware framework, which significantly outperformed existing baselines on two datasets for more accurate referent detection.

Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

View on arXiv PDF

Similar