Unrestricted Adversarial Samples Based on Non-semantic Feature Clusters Substitution
This addresses a critical vulnerability in adversarial defense mechanisms for image classification, though it is incremental as it builds on existing adversarial attack concepts.
The paper tackles the problem of generating adversarial examples that bypass defenses relying on Lp norm constraints by introducing unrestricted perturbations based on spurious feature clusters learned by models. The result shows that these adversarial samples, which do not alter image semantics, effectively fool adversarially trained DNN classifiers in both black-box and white-box scenarios.
Most current methods generate adversarial examples with the $L_p$ norm specification. As a result, many defense methods utilize this property to eliminate the impact of such attacking algorithms. In this paper,we instead introduce "unrestricted" perturbations that create adversarial samples by using spurious relations which were learned by model training. Specifically, we find feature clusters in non-semantic features that are strongly correlated with model judgment results, and treat them as spurious relations learned by the model. Then we create adversarial samples by using them to replace the corresponding feature clusters in the target image. Experimental evaluations show that in both black-box and white-box situations. Our adversarial examples do not change the semantics of images, while still being effective at fooling an adversarially trained DNN image classifier.