Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
This addresses a robustness issue in LLM alignment methods for researchers and practitioners, though it appears incremental as an enhancement to existing direct alignment approaches.
The paper tackles the problem of degenerate policies in direct preference optimization methods when faced with noisy or non-deterministic preference labels, by introducing DRDO which simultaneously models rewards and preferences. Results show DRDO-trained policies surpass DPO and e-DPO methods in expected rewards and robustness, with improvements demonstrated on Ultrafeedback and TL;DR datasets.
Traditional RLHF-based LLM alignment methods explicitly maximize the expected rewards from a separate reward model. More recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods which rely heavily on the Bradley-Terry-based pairwise preference formulation can still lead to degenerate policies when challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs with low confidence. This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.