CLAILGAug 28, 2025

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

arXiv:2508.20766v15 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses safety vulnerabilities in LLMs for users and developers, offering a lightweight, fine-tuning-free approach to enhance model safety, though it is incremental as it builds on existing alignment techniques.

The paper tackles the problem of safety alignment in Large Language Models being bypassed by proposing Rank-One Safety Injection (ROSI), a method that amplifies safety by steering activations toward refusal-mediating subspaces, resulting in increased safety refusal rates while preserving utility on benchmarks.

Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes