CLDec 8, 2025

Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Max Zhang, Derek Liu, Kai Zhang, Joshua Franco, Haihao Liu

arXiv:2602.111572 citationsh-index: 9Has Code

AI Analysis

This study highlights a counterintuitive safety degradation in multilingual LLM alignment, which is important for developers deploying models in non-English contexts.

The paper explores using knowledge distillation to transfer refusal behaviors from a proprietary LLM to open-source models for multilingual jailbreak prevention, but finds that standard fine-tuning on safe refusal data inadvertently increases jailbreak success rates by up to 16.6 percentage points across student models.

Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.

View on arXiv PDF

Similar