LGJul 26, 2024

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic

arXiv:2407.18676v315.010 citationsh-index: 24

Originality Incremental advance

AI Analysis

This addresses the problem of misalignment due to preference drift for LLM developers and users, offering a robust solution that is incremental by building on existing DPO methods.

The paper tackles the problem of temporal preference drift in Large Language Model (LLM) preference optimization, which can cause misalignment, by proposing Non-Stationary Direct Preference Optimization (NS-DPO) that models time-dependent rewards with a Dynamic Bradley-Terry model and introduces a computationally efficient discount parameter for exponential weighting. The result shows that NS-DPO fine-tuned LLMs significantly outperform baseline algorithms under non-stationary preferences without sacrificing performance in stationary cases, as demonstrated with various levels of drift, reward models, and datasets.

Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

View on arXiv PDF

Similar