LGCRJan 31, 2022

Can Adversarial Training Be Manipulated By Non-Robust Features?

arXiv:2201.13329v417 citationsHas Code
AI Analysis

This work addresses a critical security problem for machine learning practitioners by exposing a novel vulnerability in adversarial training, making it incremental as it builds on existing defense methods.

The paper tackles the vulnerability of adversarial training to a new threat model called stability attacks, which manipulate training data to undermine test-time robustness, and demonstrates that conventional adversarial training fails under these attacks, necessitating adaptive defenses.

Adversarial training, originally designed to resist test-time adversarial examples, has shown to be promising in mitigating training-time availability attacks. This defense ability, however, is challenged in this paper. We identify a novel threat model named stability attack, which aims to hinder robust availability by slightly manipulating the training data. Under this threat, we show that adversarial training using a conventional defense budget $ε$ provably fails to provide test robustness in a simple statistical setting, where the non-robust features of the training data can be reinforced by $ε$-bounded perturbation. Further, we analyze the necessity of enlarging the defense budget to counter stability attacks. Finally, comprehensive experiments demonstrate that stability attacks are harmful on benchmark datasets, and thus the adaptive defense is necessary to maintain robustness. Our code is available at https://github.com/TLMichael/Hypocritical-Perturbation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes