Manipulating Feature Visualizations with Gradient Slingshots
This work addresses a critical trustworthiness issue in AI interpretability for researchers and practitioners, revealing a significant vulnerability in widely used explanation methods.
The paper tackles the vulnerability of Feature Visualization (FV) explanations in deep neural networks by introducing Gradient Slingshots, a method that manipulates FV to produce arbitrary targets without altering model architecture, exposing that auditors relying on FV may accept fabricated explanations.
Feature Visualization (FV) is a widely used technique for interpreting the concepts learned by Deep Neural Networks (DNNs), which synthesizes input patterns that maximally activate a given feature. Despite its popularity, the trustworthiness of FV explanations has received limited attention. In this paper, we introduce a novel method, Gradient Slingshots, that enables manipulation of FV without modifying the model architecture or significantly degrading its performance. By shaping new trajectories in the off-distribution regions of the activation landscape of a feature, we coerce the optimization process to converge in a predefined visualization. We evaluate our approach on several DNN architectures, demonstrating its ability to replace faithfuls FV with arbitrary targets. These results expose a critical vulnerability: auditors relying solely on FV may accept entirely fabricated explanations. To mitigate this risk, we propose a straightforward defense and quantitatively demonstrate its effectiveness.