LGAIAug 8, 2025

In-Training Defenses against Emergent Misalignment in Language Models

arXiv:2508.06249v17 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses a security vulnerability for providers offering fine-tuning APIs, where attackers can inadvertently create broadly misaligned models, though the study is incremental as it builds on existing regularization techniques.

The paper tackles the problem of emergent misalignment in language models during fine-tuning, where domain-specific tuning can induce harmful behaviors beyond the target domain, and evaluates four in-training regularization interventions to mitigate this issue, showing that methods like KL-divergence regularization and SafeLoRA reduce misalignment by up to 40% on malicious tasks while maintaining performance on benign tasks.

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods' emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods' impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes