LGAICVJan 11, 2024

Manipulating Feature Visualizations with Gradient Slingshots

arXiv:2401.06122v37 citationsh-index: 11
Originality Highly original
AI Analysis

This work addresses a critical trustworthiness issue in AI interpretability for researchers and practitioners, revealing a significant vulnerability in widely used explanation methods.

The paper tackles the vulnerability of Feature Visualization (FV) explanations in deep neural networks by introducing Gradient Slingshots, a method that manipulates FV to produce arbitrary targets without altering model architecture, exposing that auditors relying on FV may accept fabricated explanations.

Feature Visualization (FV) is a widely used technique for interpreting the concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. In this paper, we introduce a novel method, Gradient Slingshots, that enables manipulation of FV without modifying the model architecture or significantly degrading its performance. By shaping new trajectories in the off-distribution regions of the activation landscape of a feature, we coerce the optimization process to converge in a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithfuls FV with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes