A simple defense against adversarial attacks on heatmap explanations
This addresses the issue of fair-washing in sensitive ML applications, but it is incremental as it builds on existing explanation methods.
The paper tackles the problem of adversarial attacks on heatmap explanations in neural networks, which can hide discriminatory features, and presents a defense method that aggregates multiple explanation methods to achieve robustness even against attackers with full model knowledge.
With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.