LGCLSep 29, 2024

The Crucial Role of Samplers in Online Direct Preference Optimization

Tsinghua
arXiv:2409.19605v324 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses a theoretical gap in DPO for researchers and practitioners in AI alignment, though it is incremental as it builds on existing DPO methods.

The paper tackles the problem of analyzing the impact of samplers on the convergence rates of Direct Preference Optimization (DPO) for language model alignment, revealing that uniform sampling achieves linear convergence while their proposed online sampler achieves quadratic convergence and outperforms vanilla DPO by over 7.4% on the Safe-RLHF dataset.

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes