ROCVLGOct 6, 2022

Real-World Robot Learning with Masked Visual Pre-training

arXiv:2210.03109v1342 citationsh-index: 164
Originality Incremental advance
AI Analysis

This work addresses the challenge of developing effective visual representations for real-world robotic applications, though it is incremental as it builds on existing masked autoencoder methods.

The paper tackles the problem of improving robot learning by using self-supervised visual pre-training on diverse, real-world images, showing that their masked autoencoder-based encoder outperforms CLIP by up to 75%, supervised ImageNet pre-training by up to 81%, and training from scratch by up to 81% across various robotic tasks.

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic tasks and embodiments. We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%). Finally, we train a 307M parameter vision transformer on a massive collection of 4.5M images from the Internet and egocentric videos, and demonstrate clearly the benefits of scaling visual pre-training for robot learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes