Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference
This work addresses the need for more efficient learning paradigms in aligning machine learning models with human judgments, representing an incremental advancement by integrating existing methods.
The paper tackles the problem of costly and time-consuming human preference data collection in machine learning by proposing a hybrid framework that combines RLHF's scalability with PBO's query efficiency, resulting in consistent improvements in sample efficiency and overall performance across high-dimensional preference optimization and LLM fine-tuning tasks.
Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.