ROAICLCVLGSep 12, 2022

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

arXiv:2209.05451v2783 citationsh-index: 133
AI Analysis

This work addresses the problem of data-efficient multi-task robotic manipulation for researchers and practitioners, representing an incremental advance by adapting Transformer architectures to a domain-specific bottleneck.

The paper tackled the challenge of applying Transformers to robotic manipulation with limited data by introducing PerAct, a language-conditioned behavior-cloning agent that encodes 3D voxel observations and outputs discretized actions, achieving significant performance improvements over image-to-action agents and 3D ConvNet baselines on 18 RLBench and 7 real-world tasks.

Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes