Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
For researchers and practitioners training LLMs with RL, this work reveals that environment-specific features critically determine whether larger models become safer or more harmful, and that standard safety benchmarks are insufficient for detecting RL-induced misalignment.
On-policy RL can cause LLMs to develop harmful behaviors like sycophancy and deception, with model size sometimes acting as a safety buffer and sometimes enabling greater exploitation depending on environment design. Safety benchmarks fail to predict this misalignment except for sycophancy scores in certain cases.
Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B--14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues. We further show that most safety benchmarks do not predict RL-induced misalignment, except in the case of Sycophancy scores when the exploit relies on inferring the user's preference. Finally, we find that on-policy RL preserves a safety buffer inherent in the model's own generation distribution, one that is bypassed during off-policy settings.