CLFeb 2

There Is More to Refusal in Large Language Models than a Single Direction

arXiv:2602.02132v13 citationsh-index: 31
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding and controlling refusal mechanisms in large language models for AI safety and alignment researchers, providing a more nuanced view than previous incremental findings.

The paper challenges the prior claim that refusal in large language models is mediated by a single activation-space direction, showing instead that refusal behaviors correspond to geometrically distinct directions across eleven categories, but linear steering along any refusal-related direction produces similar refusal-to-over-refusal trade-offs as a shared control knob.

Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes