Generative Image as Action Models
This work addresses the challenge of robust and generalizable robot manipulation for robotics, representing a novel application of diffusion models rather than an incremental improvement.
The authors tackled the problem of adapting image-generation diffusion models for visuomotor control by fine-tuning Stable Diffusion to generate joint-action targets on images, which are then mapped to joint positions, resulting in policies that outperform state-of-the-art visuomotor approaches in robustness and generalization on 25 RLBench and 9 real-world tasks.
Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.