Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games
This addresses convergence issues in adversarial training for machine learning practitioners, but it is incremental as it builds on existing game-theoretic frameworks.
The paper tackles the problem of adversarial training's convergence in adversarial robustness games, proving that alternating best-response strategies may not converge even for a linear classifier, while a unique pure Nash equilibrium exists and is robust, with experimental support.
Adversarial training is a standard technique for training adversarially robust models. In this paper, we study adversarial training as an alternating best-response strategy in a 2-player zero-sum game. We prove that even in a simple scenario of a linear classifier and a statistical model that abstracts robust vs. non-robust features, the alternating best response strategy of such game may not converge. On the other hand, a unique pure Nash equilibrium of the game exists and is provably robust. We support our theoretical results with experiments, showing the non-convergence of adversarial training and the robustness of Nash equilibrium.