ST LG MLJun 10, 2025

On Monotonicity in AI Alignment

Gilles Bareilles, Julien Fageot, Lê-Nguyên Hoang, Peva Blanchard, Wassim Bouaziz, Sébastien Rouault, El-Mahdi El-Mhamdi

arXiv:2506.08998v12.31 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses a counterintuitive problem in AI alignment for developers, offering theoretical insights to improve trustworthiness, though it is incremental as it builds on existing frameworks.

The paper investigates non-monotonic behavior in comparison-based preference learning methods like DPO, GPO, and GBT, where models can decrease the probability of preferred responses, and proves they satisfy local pairwise monotonicity under mild assumptions while providing formalizations and conditions to evaluate such violations.

Comparison-based preference learning has become central to the alignment of AI models with human preferences. However, these methods may behave counterintuitively. After empirically observing that, when accounting for a preference for response $y$ over $z$, the model may actually decrease the probability (and reward) of generating $y$ (an observation also made by others), this paper investigates the root causes of (non) monotonicity, for a general comparison-based preference learning framework that subsumes Direct Preference Optimization (DPO), Generalized Preference Optimization (GPO) and Generalized Bradley-Terry (GBT). Under mild assumptions, we prove that such methods still satisfy what we call local pairwise monotonicity. We also provide a bouquet of formalizations of monotonicity, and identify sufficient conditions for their guarantee, thereby providing a toolbox to evaluate how prone learning models are to monotonicity violations. These results clarify the limitations of current methods and provide guidance for developing more trustworthy preference learning algorithms.

View on arXiv PDF

Similar