ROMay 28

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

Victor Kowalski, Chengxi Li, Dongheui Lee

arXiv:2605.2956464.3

AI Analysis

For robotic manipulation, this work provides a practical method to combine the fast training of vision-based RL with the robustness of proprioceptive policies, eliminating the need for domain randomization or data augmentation.

The paper presents a human-in-the-loop RL framework that distills a vision-enabled teacher policy into a vision-free student policy using real-world training, achieving 95% success on the NIST assembly benchmark after 50 minutes of training and robust generalization to 8 unseen task variants.

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

View on arXiv PDF

Similar