LGAug 20, 2025

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

arXiv:2508.14947v21 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses stability and tunability problems in preference alignment for AI models, though it appears incremental as it builds directly on DPO.

The paper tackled the overfitting and collapse issues in Direct Preference Optimization (DPO) by proposing Linear Preference Optimization (LPO), which improved performance on general text, math, and text-to-speech tasks through gradient decoupling and regularization techniques.

DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that LPO consistently improves performance on various tasks, including general text tasks, math tasks, and text-to-speech (TTS) tasks. These results establish LPO as a robust and tunable paradigm for preference alignment, and we release the source code, models, and training data publicly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes