LGAIMar 24

SafeSeek: Universal Attribution of Safety Circuits in Language Models

arXiv:2603.2326852.02 citationsh-index: 22
Predicted impact top 1% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the challenge of unreliable safety attribution in LLMs for researchers and practitioners, offering a more generalizable method, though it is incremental as it builds on existing mechanistic interpretability approaches.

The paper tackles the problem of attributing safety-critical behaviors in Large Language Models (LLMs) by proposing a unified interpretability framework that identifies sparse safety circuits via optimization, achieving results such as reducing backdoor attack success rates from 100% to 0.4% with minimal utility loss and spiking alignment attack rates from 0.8% to 96.9% upon circuit removal.

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% $\to$ 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes