CLAIApr 5, 2025

Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

arXiv:2504.04215v12 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses safety degradation in compressed models for users deploying efficient AI systems, but it is incremental as it builds on existing interpretability findings.

The authors tackled the problem of compressed language models losing safety features by using mechanistic interpretability to analyze refusal mechanisms, and they developed a lightweight method to improve safety without harming performance.

The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes