CLAIMay 21

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

arXiv:2605.2246257.6
Predicted impact top 76% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers studying mechanistic interpretability, this work provides a structured pipeline that reveals gaps between detection and causal robustness, though the findings are incremental as they largely confirm known circuits and limitations.

The paper proposes a five-stage methodology for causal feature analysis in transformer LMs and applies it to GPT-2 small on the IOI task, finding that the canonical circuit transfers robustly but feature-level causal effects degrade under distribution shifts, and a deployment monitor achieves 99.1% cost savings.

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes