R3D: Revisiting 3D Policy Learning
For researchers in robot learning, this work provides a robust and scalable foundation for 3D imitation learning, addressing key bottlenecks that previously hindered progress.
The paper identifies training instabilities and overfitting in 3D policy learning, caused by lack of 3D data augmentation and Batch Normalization, and proposes a transformer-based 3D encoder with diffusion decoder that significantly outperforms state-of-the-art 3D baselines on manipulation benchmarks.
3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/