Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets
This work addresses the challenge of viewpoint robustness in robot manipulation for researchers and practitioners, offering a solution to leverage simulation data effectively, though it is incremental in improving existing sim2real translation techniques.
The paper tackles the problem of training vision-based robot manipulation policies that are robust to camera viewpoint variations by proposing MANGO, an unpaired image translation method that translates simulated observations to diverse real-world viewpoints. The method achieves a 60% success rate on previously failing views when used for data augmentation in imitation learning.
Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO -- an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60\% on views that the non-augmented policy fails completely on.