Explainability as statistical inference
This work addresses the problem of inconsistent and heuristic-driven interpretability methods in machine learning, offering a foundational statistical framework that could standardize and enhance explainability across domains.
The paper tackles the lack of a unified framework for model explanation methods by proposing a new approach that frames interpretability as a statistical inference problem, resulting in a general deep probabilistic model that can be adapted to various architectures and problems, with experimental validation showing improved interpretations using multiple imputation.
A wide variety of model explanation approaches have been proposed in recent years, all guided by very different rationales and heuristics. In this paper, we take a new route and cast interpretability as a statistical inference problem. We propose a general deep probabilistic model designed to produce interpretable predictions. The model parameters can be learned via maximum likelihood, and the method can be adapted to any predictor network architecture and any type of prediction problem. Our method is a case of amortized interpretability models, where a neural network is used as a selector to allow for fast interpretation at inference time. Several popular interpretability methods are shown to be particular cases of regularised maximum likelihood for our general model. We propose new datasets with ground truth selection which allow for the evaluation of the features importance map. Using these datasets, we show experimentally that using multiple imputation provides more reasonable interpretations.