LGAICLFeb 24, 2025

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

arXiv:2502.17420v172 citationsh-index: 23ICML
Originality Incremental advance
AI Analysis

This work addresses the safety vulnerabilities in LLMs for AI security researchers, providing a foundational framework for analyzing refusal mechanisms, though it is incremental in refining existing representation engineering methods.

The study tackled the problem of understanding how adversarial inputs bypass safety alignment in large language models by revealing that refusal behavior is mediated by multiple independent directions and multi-dimensional concept cones, rather than a single direction as previously thought, and introduced a gradient-based approach to identify these mechanisms.

The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes