CVLGSep 27, 2025

Activation Matching for Explanation Generation

arXiv:2509.23051v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the need for practical and faithful minimalist explanations in AI interpretability, though it is incremental as it builds on existing explanation methods.

The paper tackles the problem of generating minimal and faithful explanations for pretrained image classifiers by introducing an activation-matching approach that trains a lightweight autoencoder to produce binary masks preserving model predictions and activations, resulting in small, human-interpretable masks that retain classifier behavior.

In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image $x$ and a frozen model $f$, we train a lightweight autoencoder to output a binary mask $m$ such that the explanation $e = m \odot x$ preserves both the model's prediction and the intermediate activations of \(x\). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes