Certified Robustness to Clean-Label Poisoning Using Diffusion Denoising
This addresses security vulnerabilities in machine learning training data for applications requiring robust models, though it is incremental as it builds on existing techniques like randomized smoothing.
The paper tackles clean-label poisoning attacks by proposing a certified defense using diffusion denoising, which reduces attack success rates to 0-16% with minimal accuracy drop.
We present a certified defense to clean-label poisoning attacks under $\ell_2$-norm. These attacks work by injecting a small number of poisoning samples (e.g., 1%) that contain bounded adversarial perturbations into the training data to induce a targeted misclassification of a test-time input. Inspired by the adversarial robustness achieved by $randomized$ $smoothing$, we show how an off-the-shelf diffusion denoising model can sanitize the tampered training data. We extensively test our defense against seven clean-label poisoning attacks in both $\ell_2$ and $\ell_{\infty}$-norms and reduce their attack success to 0-16% with only a negligible drop in the test accuracy. We compare our defense with existing countermeasures against clean-label poisoning, showing that the defense reduces the attack success the most and offers the best model utility. Our results highlight the need for future work on developing stronger clean-label attacks and using our certified yet practical defense as a strong baseline to evaluate these attacks.