CVNov 13, 2024

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

arXiv:2411.08451v11 citationsh-index: 11IEEE transactions on circuits and systems for video technology (Print)
Originality Incremental advance
AI Analysis

This addresses a key challenge for intelligent agents in interpreting human gestures and language, with incremental improvements in accuracy and novel distance-aware mechanisms.

The paper tackled the problem of misinterpretations in embodied reference understanding by introducing a distance-aware framework that predicts target objects and attention sources from pointing gestures, achieving 76.4% accuracy at 0.25 IoU and surpassing human performance at 0.75 IoU.

Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes