Understanding Adversarial Examples from the Mutual Influence of Images and Perturbations
This work addresses the challenge of explaining adversarial examples for machine learning security, offering a novel perspective and method that is incremental but advances targeted attacks without data access.
The paper tackles the problem of understanding adversarial examples by analyzing the mutual influence between images and perturbations using DNN logits and Pearson correlation, revealing that universal perturbations contain dominant features while images act like noise. This insight leads to a new method for generating targeted universal adversarial perturbations without original training data, achieving comparable performance to state-of-the-art baselines that use the original dataset.
A wide variety of works have explored the reason for the existence of adversarial examples, but there is no consensus on the explanation. We propose to treat the DNN logits as a vector for feature representation, and exploit them to analyze the mutual influence of two independent inputs based on the Pearson correlation coefficient (PCC). We utilize this vector representation to understand adversarial examples by disentangling the clean images and adversarial perturbations, and analyze their influence on each other. Our results suggest a new perspective towards the relationship between images and universal perturbations: Universal perturbations contain dominant features, and images behave like noise to them. This feature perspective leads to a new method for generating targeted universal adversarial perturbations using random source images. We are the first to achieve the challenging task of a targeted universal attack without utilizing original training data. Our approach using a proxy dataset achieves comparable performance to the state-of-the-art baselines which utilize the original training dataset.