LGCVApr 16, 2024

Do Counterfactual Examples Complicate Adversarial Training?

arXiv:2404.10588v21 citationsh-index: 4
Originality Incremental advance
AI Analysis

This challenges the assumption that non-robust features are not interpretable, with implications for adversarial training in machine learning.

The study investigates the robustness-performance tradeoff in classifiers by generating low-norm counterfactual examples using diffusion models, finding that robust models' accuracy on clean data correlates with proximity to these examples and perform poorly on the examples themselves, indicating overlap between non-robust and semantic features.

We leverage diffusion models to study the robustness-performance tradeoff of robust classifiers. Our approach introduces a simple, pretrained diffusion method to generate low-norm counterfactual examples (CEs): semantically altered data which results in different true class membership. We report that the confidence and accuracy of robust models on their clean training data are associated with the proximity of the data to their CEs. Moreover, robust models perform very poorly when evaluated on the CEs directly, as they become increasingly invariant to the low-norm, semantic changes brought by CEs. The results indicate a significant overlap between non-robust and semantic features, countering the common assumption that non-robust features are not interpretable.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes