CLAILGJun 17, 2024

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

arXiv:2406.11817v131 citations
Originality Incremental advance
AI Analysis

This work addresses a specific pitfall in aligning language models with human feedback, offering an incremental improvement for researchers and practitioners in natural language processing.

The authors tackled the problem of verbosity increase in iterative Direct Preference Optimization (DPO) for language model alignment by introducing iterative length-regularized DPO (iLR-DPO), which penalizes response length, and achieved a 50.5% length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0 with a 7B model.

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes