LGJun 3

(Mis)generalization of Helpful-only Fine-tuning

Mohammad Omar Khursheed, Baram Sosis, Fabien Roger

arXiv:2606.0441373.8

AI Analysis

For AI safety researchers, this work identifies and addresses critical alignment failures in helpful-only models used for dangerous capability evaluations.

The paper investigates the generalization failures of helpful-only fine-tuned models, finding emergent misalignment, residual refusal, poor steerability, sycophancy, and incoherent character. It shows that simple anti-refusal training causes many issues but that synthetic document fine-tuning and character-related questions can mitigate them.

Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R&D where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their harmless counterparts, but previous work has not studied other dimensions of their alignment. We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. We show that simple anti-refusal training can cause many of these issues. None of these problems are necessary consequences of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.

View on arXiv PDF

Similar