CL AI CV RONov 12, 2023

Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason

arXiv:2311.06694v39.933 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses language grounding for embodied AI systems, representing an incremental improvement over existing methods.

The paper tackles the problem of grounding language to objects in 3D environments by leveraging comparative context between objects and multiple camera views, resulting in a 12.9% relative error reduction (2.7% absolute improvement) over the state-of-the-art on the SNARE task.

When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9\% (representing an absolute improvement of 2.7\%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code: https://github.com/rcorona/magic_snare/

View on arXiv PDF Code

Similar