LG AI CVJan 11, 2024

Manipulating Feature Visualizations with Gradient Slingshots

Dilyara Bareeva, Marina M. -C. Höhne, Alexander Warnecke, Lukas Pirch, Klaus-Robert Müller, Konrad Rieck, Sebastian Lapuschkin, Kirill Bykov

arXiv:2401.06122v39.27 citationsh-index: 11Has Code

Originality Highly original

AI Analysis

This work addresses a critical trustworthiness issue in AI interpretability for researchers and practitioners, revealing a significant vulnerability in widely used explanation methods.

The paper tackles the vulnerability of Feature Visualization (FV) explanations in deep neural networks by introducing Gradient Slingshots, a method that manipulates FV to produce arbitrary targets without altering model architecture, exposing that auditors relying on FV may accept fabricated explanations.

Feature Visualization (FV) is a widely used technique for interpreting the concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. In this paper, we introduce a novel method, Gradient Slingshots, that enables manipulation of FV without modifying the model architecture or significantly degrading its performance. By shaping new trajectories in the off-distribution regions of the activation landscape of a feature, we coerce the optimization process to converge in a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithfuls FV with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.

View on arXiv PDF Code

Similar