Offline Learning of Counterfactual Predictions for Real-World Robotic Reinforcement Learning
This work addresses sample efficiency and exploration issues in real-world robotic reinforcement learning, particularly for contact-rich tasks, though it is incremental in combining offline learning with online policy training.
The paper tackles the challenge of training robotic manipulation policies with both visuomotor and contact-rich skills using only terminal rewards, by learning counterfactual predictions from offline data to improve sample efficiency and guide exploration. The method enables efficient reinforcement learning in simulation and real-world settings, as demonstrated through various experiments.
We consider real-world reinforcement learning (RL) of robotic manipulation tasks that involve both visuomotor skills and contact-rich skills. We aim to train a policy that maps multimodal sensory observations (vision and force) to a manipulator's joint velocities under practical considerations. We propose to use offline samples to learn a set of general value functions (GVFs) that make counterfactual predictions from the visual inputs. We show that combining the offline learned counterfactual predictions with force feedbacks in online policy learning allows efficient reinforcement learning given only a terminal (success/failure) reward. We argue that the learned counterfactual predictions form a compact and informative representation that enables sample efficiency and provides auxiliary reward signals that guide online explorations towards contact-rich states. Various experiments in simulation and real-world settings were performed for evaluation. Recordings of the real-world robot training can be found via https://sites.google.com/view/realrl.