CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

arXiv:2603.00523v10.6h-index: 5

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable circuit explanations in mechanistic interpretability for AI researchers, offering a practical uncertainty-aware framework, though it is incremental as it builds on existing attribution methods.

The paper tackles the sensitivity of mechanistic circuit discovery to arbitrary analyst choices by reframing it as an uncertainty-quantification problem, resulting in CIRCUS, a method that produces threshold-robust core circuits ~40x smaller than the union of all configurations while retaining comparable explanatory power and outperforming baselines with causal validation (p=0.0004).

Mechanistic circuit discovery is notoriously sensitive to arbitrary analyst choices, especially pruning thresholds and feature dictionaries, often yielding brittle "one-shot" explanations with no principled notion of uncertainty. We reframe circuit discovery as an uncertainty-quantification problem over these analytic degrees of freedom. Our method, CIRCUS, constructs an ensemble of attribution graphs by pruning a single raw attribution run under multiple configurations, assigns each edge a stability score (the fraction of configurations that retain it), and extracts a strict-consensus circuit consisting only of edges that appear in all views. This produces a threshold-robust "core" circuit while explicitly surfacing contingent alternatives and enabling rejection of low-agreement structure. CIRCUS requires no retraining and adds negligible overhead, since it aggregates structure across already-computed pruned graphs. On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are ~40x smaller than the union of all configurations while retaining comparable influence-flow explanatory power, and they outperform a same-edge-budget baseline (union pruned to match the consensus size). We further validate causal relevance with activation patching, where consensus-identified nodes consistently beat matched non-consensus controls (p=0.0004). Overall, CIRCUS provides a practical, uncertainty-aware framework for reporting trustworthy, auditable mechanistic circuits with an explicit core/contingent/noise decomposition.

View on arXiv PDF

Similar