CLMay 29, 2025

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

arXiv:2505.23316v15 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses a fundamental limitation in contrastive alignment methods for LLMs, offering a solution to improve model reliability in preference-based tasks, though it is incremental as it builds on DPO.

The paper tackles the problem of likelihood underdetermination in direct preference optimization (DPO) for aligning large language models, which causes reward-hacking effects, and introduces PRO, a method that resolves this issue and shows superiority in experiments with pairwise, binary, and scalar feedback.

Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) -- the seminal direct alignment method -- and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes