LGAIMLOct 26, 2024

Uncertainty-Penalized Direct Preference Optimization

arXiv:2410.20187v14 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses a specific issue in LLM alignment for varied and ambiguous human preferences, representing an incremental improvement over existing methods.

The paper tackled the problem of proxy reward overoptimization in aligning LLMs to human preferences by developing a pessimistic framework for Direct Preference Optimization that penalizes preference uncertainty, showing improved overall performance and better completions on high-uncertainty prompts compared to vanilla DPO.

Aligning Large Language Models (LLMs) to human preferences in content, style, and presentation is challenging, in part because preferences are varied, context-dependent, and sometimes inherently ambiguous. While successful, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are prone to the issue of proxy reward overoptimization. Analysis of the DPO loss reveals a critical need for regularization for mislabeled or ambiguous preference pairs to avoid reward hacking. In this work, we develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes, inspired by offline reinforcement learning. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. Evaluation of the methods is performed with GPT2 Medium on the Anthropic-HH dataset using a model ensemble to obtain uncertainty estimates, and shows improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes