Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
For researchers in mechanistic interpretability, this method provides a more precise and constructive way to evaluate and improve causal abstractions, though it is demonstrated only on small-scale tasks.
The paper introduces a method for diagnosing neural network interpretations by partitioning the input space into well-interpreted and under-interpreted regions based on interchange intervention behavior, turning causal abstraction into a diagnostic tool that reveals where interpretations work and fail. In a toy logic task, recursively applying the method recovers a high-level hypothesis from scratch.
We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the structure of the well-interpreted and under-interpreted regions, we can identify missing distinctions in a high-level hypothesis, discover previously unmodeled intermediate variables, and combine complementary partial interpretations into a stronger one. We instantiate this idea as a simple four-step recipe and show that it yields informative error analyses across multiple causal abstraction settings. In a toy logic task, recursively applying the recipe recovers a high-level hypothesis from scratch. More broadly, our results suggest that partitioning the input space is a useful step toward more precise, constructive, and scalable mechanistic interpretability.