CL AIMar 12, 2024

ORPO: Monolithic Preference Optimization without Reference Model

arXiv:2403.07691v243.1619 citationsh-index: 8Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of simplifying preference alignment for language model developers by removing the reference model requirement, though it is incremental as it builds on existing SFT methods.

The paper tackles the problem of preference alignment in language models by introducing ORPO, a monolithic odds ratio preference optimization algorithm that eliminates the need for a separate alignment phase, achieving state-of-the-art performance with up to 12.20% on AlpacaEval2.0, 66.19% on IFEval, and 7.32 on MT-Bench.

While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$α$ (7B) and Mistral-ORPO-$β$ (7B).

View on arXiv PDF Code

Similar