SAIF: Sparse Adversarial and Imperceptible Attack Framework
This work addresses the vulnerability of image classifiers to adversarial attacks, which is a critical security issue for AI systems, but it is incremental as it builds on existing sparse attack methods.
The paper tackles the problem of adversarial attacks on neural networks by proposing SAIF, a framework that generates sparse, low-magnitude perturbations to deceive image classifiers, achieving state-of-the-art performance on ImageNet.
Adversarial attacks hamper the decision-making ability of neural networks by perturbing the input signal. The addition of calculated small distortion to images, for instance, can deceive a well-trained image classification network. In this work, we propose a novel attack technique called Sparse Adversarial and Interpretable Attack Framework (SAIF). Specifically, we design imperceptible attacks that contain low-magnitude perturbations at a small number of pixels and leverage these sparse attacks to reveal the vulnerability of classifiers. We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously optimize the attack perturbations for bounded magnitude and sparsity with $O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly imperceptible and interpretable adversarial examples, and outperforms state-of-the-art sparse attack methods on the ImageNet dataset.