Improving Adversarial Robustness of Attribution via Implicit Regularization
For practitioners needing robust explainability in deep learning, this work provides a computationally cheap alternative to explicit regularization, while revealing a fundamental limitation of attention-based attributions.
The paper shows that adversarial robustness of attributions can be achieved implicitly through standard SGD training, without explicit regularization, and demonstrates this across architectures and datasets with negligible overhead. It also identifies that softmax normalization in attention mechanisms limits this robustness, which can be restored by using kernel-based attention.
The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.