CLAILGJun 17, 2024

WPO: Enhancing RLHF with Weighted Preference Optimization

arXiv:2406.11827v246 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in aligning large language models with human values more efficiently, representing an incremental improvement over existing methods.

The paper tackles the distributional gap problem in off-policy preference optimization for RLHF by proposing Weighted Preference Optimization (WPO), which reweights preference data to simulate on-policy learning, resulting in up to 5.6% improvement over DPO on Alpaca Eval 2 and a 76.7% winning rate against GPT-4-turbo.

Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 76.7% based on Gemma-2-9b-it. We release the code and models at https://github.com/wzhouad/WPO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes