Explainability-Guided Defense: Attribution-Aware Model Refinement Against Adversarial Data Attacks
This addresses the need for transparent and robust defenses in safety-critical domains like healthcare and autonomous navigation, offering an incremental improvement by integrating interpretability into adversarial training.
The paper tackles adversarial vulnerability in deep learning by linking interpretability and robustness, using LIME to identify and suppress spurious features during training, resulting in improved adversarial robustness and out-of-distribution generalization on datasets like CIFAR-10 and CIFAR-100.
The growing reliance on deep learning models in safety-critical domains such as healthcare and autonomous navigation underscores the need for defenses that are both robust to adversarial perturbations and transparent in their decision-making. In this paper, we identify a connection between interpretability and robustness that can be directly leveraged during training. Specifically, we observe that spurious, unstable, or semantically irrelevant features identified through Local Interpretable Model-Agnostic Explanations (LIME) contribute disproportionately to adversarial vulnerability. Building on this insight, we introduce an attribution-guided refinement framework that transforms LIME from a passive diagnostic into an active training signal. Our method systematically suppresses spurious features using feature masking, sensitivity-aware regularization, and adversarial augmentation in a closed-loop refinement pipeline. This approach does not require additional datasets or model architectures and integrates seamlessly into standard adversarial training. Theoretically, we derive an attribution-aware lower bound on adversarial distortion that formalizes the link between explanation alignment and robustness. Empirical evaluations on CIFAR-10, CIFAR-10-C, and CIFAR-100 demonstrate substantial improvements in adversarial robustness and out-of-distribution generalization.