CLMay 29, 2025

Understanding Refusal in Language Models with Sparse Autoencoders

arXiv:2505.23556v113 citationsh-index: 13Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the safety challenge of interpreting refusal mechanisms in language models, which is crucial for developers and researchers focused on AI alignment and adversarial robustness.

The paper tackled the problem of understanding the internal mechanisms of refusal behavior in aligned language models, using sparse autoencoders to identify and validate latent features that causally mediate refusals across multiple harmful datasets.

Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes