Aggregating explanation methods for stable and robust explainability
This work addresses the problem of unstable and non-robust explainability in AI for researchers and practitioners, offering an incremental improvement through aggregation.
The paper tackles the lack of consensus in explaining neural network decisions by proposing aggregation schemes to combine explanation methods, showing that aggregated explanations better identify important features and are more robust to adversarial attacks than individual methods.
Despite a growing literature on explaining neural networks, no consensus has been reached on how to explain a neural network decision or how to evaluate an explanation. Our contributions in this paper are twofold. First, we investigate schemes to combine explanation methods and reduce model uncertainty to obtain a single aggregated explanation. We provide evidence that the aggregation is better at identifying important features, than on individual methods. Adversarial attacks on explanations is a recent active research topic. As our second contribution, we present evidence that aggregate explanations are much more robust to attacks than individual explanation methods.