LGMLJun 2, 2025

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

arXiv:2506.01523v13 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses the issue of degenerate solutions in alignment methods like RLHF and DPO for improving LLM output quality, offering a more principled approach.

The paper tackles the problem of alignment in large language models by reframing it as distribution learning from pairwise preferences, proposing three objectives that theoretically converge to the target model and avoid degeneracy. Empirically, the framework, particularly preference distillation, matches or outperforms RLHF and DPO across tasks and models.

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes