LGFeb 5, 2025

Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

Xuerui Su, Yue Wang, Jinhua Zhu, Mingyang Yi, Feng Xu, Zhiming Ma, Yuting Liu

arXiv:2502.03095v116.96 citationsh-index: 10

Originality Synthesis-oriented

AI Analysis

This work clarifies theoretical ambiguities for researchers in LLM alignment, but it is incremental as it builds on existing RLHF methods without introducing new algorithms.

The paper tackles the confusion about whether Direct Preference Optimization (DPO) should be classified as a Reinforcement Learning algorithm by analyzing its loss function, target distribution, and key components, establishing a unified framework to connect DPO with RL and other RLHF algorithms like PPO.

With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO's use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the impact of key components within the loss function. Specifically, we first establish a unified framework named UDRRA connecting these algorithms based on the construction of their loss functions. Next, we uncover their target policy distributions within this framework. Finally, we investigate the critical components of DPO to understand their impact on the convergence rate. Our work provides a deeper understanding of the relationship between DPO, RL, and other RLHF algorithms, offering new insights for improving existing algorithms.

View on arXiv PDF

Similar