LGCVJun 12, 2023

Adversarial Attacks on the Interpretation of Neuron Activation Maximization

arXiv:2306.07397v114 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the reliability of interpretability methods for researchers and practitioners in machine learning, but it is incremental as it builds on existing adversarial concepts.

The paper tackles the problem of adversarial manipulation of activation-maximization interpretability methods in deep neural networks, demonstrating that these techniques can be deceived to change interpretations, which reveals reliability issues.

The internal functional behavior of trained Deep Neural Networks is notoriously difficult to interpret. Activation-maximization approaches are one set of techniques used to interpret and analyze trained deep-learning models. These consist in finding inputs that maximally activate a given neuron or feature map. These inputs can be selected from a data set or obtained by optimization. However, interpretability methods may be subject to being deceived. In this work, we consider the concept of an adversary manipulating a model for the purpose of deceiving the interpretation. We propose an optimization framework for performing this manipulation and demonstrate a number of ways that popular activation-maximization interpretation techniques associated with CNNs can be manipulated to change the interpretations, shedding light on the reliability of these methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes