Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning
This work addresses the scalability problem for researchers and practitioners in reinforcement learning by making preference-based RL more practical, though it is incremental as it builds on existing methods with a hybrid approach.
The paper tackles the high cost of human feedback in preference-based reinforcement learning by introducing PrefVLM, a framework that integrates Vision-Language Models with selective human feedback to reduce annotation requirements; experiments on Meta-World tasks show it achieves comparable or superior success rates while using up to 2x fewer human annotations.
Preference-based reinforcement learning (RL) offers a promising approach for aligning policies with human intent but is often constrained by the high cost of human feedback. In this work, we introduce PrefVLM, a framework that integrates Vision-Language Models (VLMs) with selective human feedback to significantly reduce annotation requirements while maintaining performance. Our method leverages VLMs to generate initial preference labels, which are then filtered to identify uncertain cases for targeted human annotation. Additionally, we adapt VLMs using a self-supervised inverse dynamics loss to improve alignment with evolving policies. Experiments on Meta-World manipulation tasks demonstrate that PrefVLM achieves comparable or superior success rates to state-of-the-art methods while using up to 2 x fewer human annotations. Furthermore, we show that adapted VLMs enable efficient knowledge transfer across tasks, further minimizing feedback needs. Our results highlight the potential of combining VLMs with selective human supervision to make preference-based RL more scalable and practical.