How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
This provides a mechanistic explanation for a key safety fine-tuning algorithm in AI, offering an efficient, tuning-free alternative for researchers and practitioners working on language model safety.
The study tackled the problem of understanding how Direct Preference Optimization (DPO) reduces toxicity in language models, revealing that toxic neurons account for only 2.5% to 24% of its effects and that DPO works through distributed activation shifts across all MLP neurons, with an activation editing method outperforming DPO in toxicity reduction while preserving perplexity.
Safety fine-tuning algorithms reduce harmful outputs in language models, yet their mechanisms remain under-explored. Direct Preference Optimization (DPO) is a popular choice of algorithm, but prior explanations, attributing its effects solely to dampened toxic neurons in the MLP layers, are incomplete. In this study, we analyse four language models (Llama-3.1-8B, Gemma-2-2B, Mistral-7B, GPT-2-Medium) and show that toxic neurons only account for 2.5% to 24% of DPO's effects across models. Instead, DPO balances distributed activation shifts across all MLP neurons to create a net toxicity reduction. We attribute this reduction to four neuron groups, two aligned with reducing toxicity and two promoting anti-toxicity, whose combined effects replicate DPO across models. To further validate this understanding, we develop an activation editing method mimicking DPO through distributed shifts along a toxicity representation. This method outperforms DPO in reducing toxicity while preserving perplexity, without requiring any weight updates. Our work provides a mechanistic understanding of DPO and introduces an efficient, tuning-free alternative for safety fine-tuning.