ROAICVLGApr 20, 2024

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

arXiv:2404.13474v122 citationsh-index: 15ICRA
Originality Incremental advance
AI Analysis

This work addresses the challenge of leveraging pre-trained vision models for robotics, offering a practical solution for improved imitation learning in manipulators.

The paper tackles the problem of building pre-trained object-centric representations for robotic control by combining off-the-shelf models for segmentation and description without new training, resulting in improved performance and systematic generalization over state-of-the-art methods in simulated and real robotic tasks.

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose $\textbf{POCR}$, a new framework for building pre-trained object-centric representations for robotic control. Building on theories of "what-where" representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing "what" the entity is. Thus, our pre-trained object-centric representations for control are constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior object-centric representations that are typically trained from scratch.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes