Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding
This work addresses the problem of more natural human-robot interaction by moving beyond simple object descriptions, though it is incremental as it builds on existing REC tasks with new attributes.
The paper tackles the limitation of existing referring expression comprehension methods, which rely on object categories and single attributes, by proposing a multi-attribute framework integrating state, intention, and gestures, and shows improved localization performance on the new SIGAR dataset.
Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.