General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks
This addresses the challenge of poor generalization and slow convergence in GCRL for robotics applications, though it is incremental as it builds on existing mask and detection methods.
The authors tackled the problem of goal representation in visual goal-conditioned reinforcement learning by proposing an object-agnostic mask-based system, achieving 99.9% reaching accuracy on both training and unseen test objects and enabling efficient pick-up tasks without positional information.
Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.