Understanding Refusal in Language Models with Sparse Autoencoders
This work addresses the safety challenge of interpreting refusal mechanisms in language models, which is crucial for developers and researchers focused on AI alignment and adversarial robustness.
The paper tackled the problem of understanding the internal mechanisms of refusal behavior in aligned language models, using sparse autoencoders to identify and validate latent features that causally mediate refusals across multiple harmful datasets.
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.