MLCRLGJun 19, 2019

Explanations can be manipulated and geometry is to blame

arXiv:1906.07983v2390 citations
Originality Incremental advance
AI Analysis

This reveals a vulnerability in interpretability tools, which is critical for trust in AI systems, though it is incremental in addressing robustness.

The paper demonstrates that explanation methods for neural networks can be arbitrarily manipulated with small input perturbations while keeping outputs constant, linking this to geometric properties and proposing robustness mechanisms.

Explanation methods aim to make neural networks more trustworthy and interpretable. In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. Namely, we show that explanations can be manipulated arbitrarily by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant. We establish theoretically that this phenomenon can be related to certain geometrical properties of neural networks. This allows us to derive an upper bound on the susceptibility of explanations to manipulations. Based on this result, we propose effective mechanisms to enhance the robustness of explanations.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes