Purifying Adversarial Perturbation with Adversarially Trained Auto-encoders
This work addresses the problem of expensive adversarial training for machine learning practitioners, offering a more efficient solution, though it is incremental as it builds on existing adversarial training techniques.
The paper tackles the high cost of adversarial training for protecting machine learning models by training an external auto-encoder with iterative adversarial training, which can then be used to protect other models directly. The result shows that this method outperforms other purifying-based methods against white-box attacks and transfers well to models with different architectures.
Machine learning models are vulnerable to adversarial examples. Iterative adversarial training has shown promising results against strong white-box attacks. However, adversarial training is very expensive, and every time a model needs to be protected, such expensive training scheme needs to be performed. In this paper, we propose to apply iterative adversarial training scheme to an external auto-encoder, which once trained can be used to protect other models directly. We empirically show that our model outperforms other purifying-based methods against white-box attacks, and transfers well to directly protect other base models with different architectures.