LGAIFeb 27

Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification

arXiv:2602.24266v11.4h-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of verifying causal mechanisms in neural networks for researchers in interpretable AI, though it is incremental as it builds on existing pruning and causal abstraction methods.

The paper tackles the problem of discovering interpretable causal abstractions from neural networks by reframing structured pruning as a search over approximate abstractions, resulting in an efficient procedure that extracts sparse, intervention-faithful abstractions from pretrained networks, validated via interchange interventions.

Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction -- a simpler, high-level Structural Causal Model (SCM) faithful to the network under interventions. Discovering such abstractions is hard: it typically demands brute-force interchange interventions or retraining. We reframe the problem by viewing structured pruning as a search over approximate abstractions. Treating a trained network as a deterministic SCM, we derive an Interventional Risk objective whose second-order expansion yields closed-form criteria for replacing units with constants or folding them into neighbors. Under uniform curvature, our score reduces to activation variance, recovering variance-based pruning as a special case while clarifying when it fails. The resulting procedure efficiently extracts sparse, intervention-faithful abstractions from pretrained networks, which we validate via interchange interventions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes