CVSep 8, 2021

YouRefIt: Embodied Reference Understanding with Language and Gesture

arXiv:2109.03413v259 citations
AI Analysis

This work addresses the problem of improving human-robot interaction and referential behavior understanding in daily physical scenes, though it is incremental as it builds on existing multimodal and referring expression research.

The authors tackled the problem of embodied reference understanding, where an agent uses language and gesture to refer to objects in shared physical environments, by introducing the YouRefIt dataset with 4,195 reference clips in 432 indoor scenes and establishing benchmarks for image-based and video-based tasks. Their results show that gestural cues are as critical as language cues, providing the first machine perception evidence on this effect.

We study the understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. Of note, this new visual task requires understanding multimodal cues with perspective-taking to identify which object is being referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced dataset of embodied reference collected in various physical scenes; the dataset contains 4,195 unique reference clips in 432 indoor scenes. To the best of our knowledge, this is the first embodied reference dataset that allows us to study referring expressions in daily physical scenes to understand referential behavior, human communication, and human-robot interaction. We further devise two benchmarks for image-based and video-based embodied reference understanding. Comprehensive baselines and extensive experiments provide the very first result of machine perception on how the referring expressions and gestures affect the embodied reference understanding. Our results provide essential evidence that gestural cues are as critical as language cues in understanding the embodied reference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes