I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
This exposes a fundamental fragility in production AI safety architectures, challenging the assumption that safety mechanisms transfer across model updates, which is critical for ensuring reliable deployment of aligned systems.
The study found that safety classifiers for instruction-tuned models catastrophically fail under small embedding drift, with ROC-AUC dropping from 85% to 50% due to normalized perturbations of magnitude σ=0.02, while mean confidence only decreases by 14%, leading to 72% of misclassifications occurring with high confidence.
Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.