Leveraging Conditional Generative Models in a General Explanation Framework of Classifier Decisions
This work addresses the need for trustworthy, human-understandable explanations of AI classifier decisions, which is crucial for real-world applications, though it appears incremental in improving existing explanation methods.
The paper tackles the problem of generating noisy and inaccurate visual explanations for classifier decisions by proposing a new general framework that uses conditional generative models to produce explanations as differences between generated images, demonstrating significant improvements over state-of-the-art methods on three public datasets with localization consistent with human annotations.
Providing a human-understandable explanation of classifiers' decisions has become imperative to generate trust in their use for day-to-day tasks. Although many works have addressed this problem by generating visual explanation maps, they often provide noisy and inaccurate results forcing the use of heuristic regularization unrelated to the classifier in question. In this paper, we propose a new general perspective of the visual explanation problem overcoming these limitations. We show that visual explanation can be produced as the difference between two generated images obtained via two specific conditional generative models. Both generative models are trained using the classifier to explain and a database to enforce the following properties: (i) All images generated by the first generator are classified similarly to the input image, whereas the second generator's outputs are classified oppositely. (ii) Generated images belong to the distribution of real images. (iii) The distances between the input image and the corresponding generated images are minimal so that the difference between the generated elements only reveals relevant information for the studied classifier. Using symmetrical and cyclic constraints, we present two different approximations and implementations of the general formulation. Experimentally, we demonstrate significant improvements w.r.t the state-of-the-art on three different public data sets. In particular, the localization of regions influencing the classifier is consistent with human annotations.