LGSep 30, 2025

Minimalist Explanation Generation and Circuit Discovery

Pirzada Suhail, Aditya Anand, Amit Sethi

arXiv:2509.25686v14.1h-index: 2

Originality Incremental advance

AI Analysis

This work addresses the challenge of interpreting complex machine learning models in high-dimensional spaces, providing tools for better understanding and trust in AI systems, though it is incremental in building on existing explanation methods.

The paper tackles the problem of generating minimal and faithful explanations for pre-trained image classifiers by introducing an activation-matching approach that produces binary masks to highlight critical image regions, achieving concise and human-readable explanations while preserving model decisions. It also introduces a circuit readout procedure to interpret model internals by identifying active channels and constructing graphs based on activations and gradients.

Machine learning models, by virtue of training, learn a large repertoire of decision rules for any given input, and any one of these may suffice to justify a prediction. However, in high-dimensional input spaces, such rules are difficult to identify and interpret. In this paper, we introduce an activation-matching based approach to generate minimal and faithful explanations for the decisions of pre-trained image classifiers. We aim to identify minimal explanations that not only preserve the model's decision but are also concise and human-readable. To achieve this, we train a lightweight autoencoder to produce binary masks that learns to highlight the decision-wise critical regions of an image while discarding irrelevant background. The training objective integrates activation alignment across multiple layers, consistency at the output label, priors that encourage sparsity, and compactness, along with a robustness constraint that enforces faithfulness. The minimal explanations so generated also lead us to mechanistically interpreting the model internals. In this regard we also introduce a circuit readout procedure wherein using the explanation's forward pass and gradients, we identify active channels and construct a channel-level graph, scoring inter-layer edges by ingress weight magnitude times source activation and feature-to-class links by classifier weight magnitude times feature activation. Together, these contributions provide a practical bridge between minimal input-level explanations and a mechanistic understanding of the internal computations driving model decisions.

View on arXiv PDF

Similar