ROCVMay 27, 2025

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

arXiv:2505.20962v1h-index: 15ICDL
Originality Incremental advance
AI Analysis

This work addresses the challenge of reducing reliance on annotated robot-specific datasets for robotics, offering a method to accelerate training and improve generalizability, though it is incremental as it builds on existing techniques like Slot Attention and pretrained models.

The paper tackled the problem of learning visual representations from human action videos to improve robot visuo-motor policy learning, demonstrating that an object-centric encoder integrating semantic segmentation and representation generation enhances reinforcement and imitation learning in simulated tasks, with fine-tuning on out-of-domain human action data significantly boosting performance.

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions -- although still out-of-domain -- , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes