CLAICRLGJun 23, 2024

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

arXiv:2406.16235v234 citations
Originality Highly original
AI Analysis

This addresses the problem of mitigating toxicity in multilingual LLMs for global users, demonstrating strong cross-lingual generalization that is novel compared to prior limited results.

The study tackled detoxifying multilingual Large Language Models by showing that Direct Preference Optimization training with only English data significantly reduces toxicity in open-ended generations across 17 languages, with mGPT-1.3B's toxic continuation probability dropping from 46.8% to 3.9%.

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes