PLEX: Making the Most of the Available Data for Robotic Manipulation Pretraining
This work addresses the challenge of data efficiency for robotic manipulation, enabling more generalizable models with less costly demonstration data.
The authors tackled the problem of learning rich representations for robotic manipulation with limited multimodal data by proposing PLEX, a transformer-based architecture that combines a small amount of task-agnostic visuomotor trajectories with a larger set of task-conditioned object manipulation videos, achieving state-of-the-art performance in challenging Robosuite environments and showcasing generalization on Meta-World.
A rich representation is key to general robotic manipulation, but existing approaches to representation learning require large amounts of multimodal demonstrations. In this work we propose PLEX, a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories and a much larger amount of task-conditioned object manipulation videos -- a type of data available in quantity. PLEX uses visuomotor trajectories to induce a latent feature space and to learn task-agnostic manipulation routines, while diverse video-only demonstrations teach PLEX how to plan in the induced latent feature space for a wide variety of tasks. Experiments showcase PLEX's generalization on Meta-World and SOTA performance in challenging Robosuite environments. In particular, using relative positional encoding in PLEX's transformers greatly helps in low-data regimes of learning from human-collected demonstrations. The paper's accompanying code and data are available at https://microsoft.github.io/PLEX.