RO CVNov 15, 2021

Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement

Walter Goodwin, Sagar Vaze, Ioannis Havoutis, Ingmar Posner

arXiv:2111.07975v117.945 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a key challenge in robotic scene rearrangement by enabling more general and robust object matching without instance-specific training data, though it is incremental as it builds on existing vision-language models.

The paper tackles the problem of matching objects between a robot's scene and a goal image when they share no object instances, by using a vision-language model to leverage semantics and visual features for similarity, resulting in considerably improved matching performance for cross-instance robotic rearrangement.

Object rearrangement has recently emerged as a key competency in robot manipulation, with practical solutions generally involving object detection, recognition, grasping and high-level planning. Goal-images describing a desired scene configuration are a promising and increasingly used mode of instruction. A key outstanding challenge is the accurate inference of matches between objects in front of a robot, and those seen in a provided goal image, where recent works have struggled in the absence of object-specific training data. In this work, we explore the deterioration of existing methods' ability to infer matches between objects as the visual shift between observed and goal scenes increases. We find that a fundamental limitation of the current setting is that source and target images must contain the same $\textit{instance}$ of every object, which restricts practical deployment. We present a novel approach to object matching that uses a large pre-trained vision-language model to match objects in a cross-instance setting by leveraging semantics together with visual features as a more robust, and much more general, measure of similarity. We demonstrate that this provides considerably improved matching performance in cross-instance settings, and can be used to guide multi-object rearrangement with a robot manipulator from an image that shares no object $\textit{instances}$ with the robot's scene.

View on arXiv PDF Code

Similar