RO CV LGFeb 5, 2023

Multi-View Masked World Models for Visual Robotic Manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel

arXiv:2302.02408v229.7105 citationsh-index: 164Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of utilizing multi-view data for visual robotic manipulation, offering a method that improves training and transfer in robotics applications.

The paper tackles the problem of learning representations from multi-view visual data for robotic manipulation by training a multi-view masked autoencoder to reconstruct pixels and using it with a world model, resulting in effective policy training with viewpoint randomization and transfer to real-robot tasks without calibration.

Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this paper, we investigate how to learn good representations with multi-view data and utilize them for visual robotic manipulation. Specifically, we train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints and then learn a world model operating on the representations from the autoencoder. We demonstrate the effectiveness of our method in a range of scenarios, including multi-view control and single-view control with auxiliary cameras for representation learning. We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization and transferring the policy to solve real-robot tasks without camera calibration and an adaptation procedure. Video demonstrations are available at: https://sites.google.com/view/mv-mwm.

View on arXiv PDF Code

Similar