LGNov 30, 2025

What Is Preference Optimization Doing, How and Why?

arXiv:2512.00778v14 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work provides deeper insights into preference optimization mechanisms, which is crucial for improving alignment in large language models, though it is incremental as it builds on existing methods.

The paper analyzes the optimization dynamics of preference optimization methods like DPO and PPO in large language models, revealing that DPO follows stable targets while PPO uses dynamic targets for exploration-exploitation balance, and it examines the distinct roles of components such as positive learning, negative learning, and loss reweighting.

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation, thus validating the common belief from a new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the learning targets meanwhile mutually offset each other. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to absolute values of token-level advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes