LGOct 23, 2025

Why DPO is a Misspecified Estimator and How to Fix It

arXiv:2510.20413v15 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses a critical flaw in DPO for aligning AI models with human preferences, offering a fix that improves robustness and performance in practical applications.

The paper identifies that Direct Preference Optimization (DPO) is a misspecified estimator when the true reward function cannot be realized by the policy class, leading to issues like preference order reversal and sensitivity to data distribution. It proposes AuxDPO, which adds auxiliary variables to the DPO loss to better approximate RLHF solutions, showing superior performance in bandit settings and LLM alignment tasks.

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes