ROCVMar 15

OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer

arXiv:2603.1440191.72 citationsh-index: 25
Predicted impact top 9% in RO · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of robust robot manipulation learning from human videos, which is incremental as it builds on existing object-centric and multimodal methods.

The paper tackles the problem of transferring human actions to robots from demonstration videos by introducing OCRA, an object-centric framework that uses 3D and tactile priors, resulting in significant performance improvements over baselines in vision-only and visuo-tactile tasks.

We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes