LGAICLMLOct 11, 2024

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Princeton
arXiv:2410.08847v465 citationsh-index: 14ICLR
Originality Incremental advance
AI Analysis

This addresses a critical problem in aligning language models with human preferences, revealing how DPO can unintentionally lead to unalignment, which is significant for AI safety but incremental in improving existing methods.

The paper tackles the counter-intuitive phenomenon of likelihood displacement in Direct Preference Optimization (DPO), where training to prefer certain responses can decrease their likelihood and shift probability to opposite or harmful responses, as shown by reducing refusal rates from 74.4% to 33.4% in experiments. It identifies the cause as preferences with similar embeddings and proposes a CHES score to filter problematic samples, mitigating the issue.

Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes