LGFeb 28, 2025

Gradient Imbalance in Direct Preference Optimization

arXiv:2502.20847v14 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in DPO for improving alignment in AI systems, representing an incremental advancement.

The paper tackled the suboptimal performance of Direct Preference Optimization (DPO) compared to RLHF pipelines by identifying gradient imbalance as a critical limitation, and proposed Balanced-DPO, a modified objective with gradient reweighting that effectively addresses this issue.

Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes