LG AIFeb 18, 2025

KL Penalty Control via Perturbation for Direct Preference Optimization

Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

arXiv:2502.13177v313.05 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient KL trade-offs in DPO for researchers and practitioners in language model alignment, though it appears incremental as it builds directly on DPO.

The paper tackles the limitation of Direct Preference Optimization (DPO) having a static KL penalty by proposing ε-DPO, which adaptively controls the KL penalty strength for each preference pair, resulting in significant improvements over existing direct alignment algorithms on general chatbot benchmarks.

Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods claim to change this static KL penalty of DPO into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO), which allows adaptive control of the KL penalty strength $β$ for each preference pair. Specifically, $\varepsilon$-DPO adaptively controls $β$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $β$ during training. This is equivalent to adjusting the KL penalty by checking whether the change in training-time temperature can lead to better preference confidence as preference models by simply reusing the logit of the current policy and the reference policy. Experimental results show that the simple criterion of $\varepsilon$-DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms on general chatbot benchmarks and reveal that this KL penalty control criterion can reflect confusion as a preference model and provide an efficient KL trade-off, highlighting the significance of instance-level adaptive KL penalty control in DPO.

View on arXiv PDF Code

Similar