AILGJul 11, 2024

$β$-DPO: Direct Preference Optimization with Dynamic $β$

arXiv:2407.08639v297 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in aligning large language models with human preferences, offering a more robust and adaptable training method, though it is incremental as it builds on existing DPO techniques.

The paper tackles the sensitivity of Direct Preference Optimization (DPO) to its trade-off parameter β and data quality by introducing a framework that dynamically calibrates β at the batch level and incorporates β-guided data filtering, demonstrating significant performance improvements across various models and datasets.

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $β$, as well as to the quality of the preference data. We analyze the impact of $β$ and data quality on DPO, uncovering that optimal $β$ values vary with the informativeness of pairwise data. Addressing the limitations of static $β$ values, we introduce a novel framework that dynamically calibrates $β$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $β$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $β$ adjustment technique significantly improves DPO's performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://github.com/junkangwu/beta-DPO}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes