ECINN: Efficient Counterfactuals from Invertible Neural Networks
This work addresses the need for interpretable AI by providing efficient and effective counterfactual explanations for image classification, though it is incremental in improving speed and accuracy over prior methods.
The paper tackles the problem of generating counterfactual examples to explain deep neural network classifiers by proposing ECINN, a method that uses invertible neural networks to efficiently produce these examples in only two classifier evaluations, compared to thousands for competing methods, and extends it to ECINNh for heatmap-based explanations that outperform existing approaches.
Counterfactual examples identify how inputs can be altered to change the predicted class of a classifier, thus opening up the black-box nature of, e.g., deep neural networks. We propose a method, ECINN, that utilizes the generative capacities of invertible neural networks for image classification to generate counterfactual examples efficiently. In contrast to competing methods that sometimes need a thousand evaluations or more of the classifier, ECINN has a closed-form expression and generates a counterfactual in the time of only two evaluations. Arguably, the main challenge of generating counterfactual examples is to alter only input features that affect the predicted outcome, i.e., class-dependent features. Our experiments demonstrate how ECINN alters class-dependent image regions to change the perceptual and predicted class of the counterfactuals. Additionally, we extend ECINN to also produce heatmaps (ECINNh) for easy inspection of, e.g., pairwise class-dependent changes in the generated counterfactual examples. Experimentally, we find that ECINNh outperforms established methods that generate heatmap-based explanations.