Pixel-level Certified Explanations via Randomized Smoothing
This addresses the vulnerability of attribution methods to adversarial perturbations, which undermines trust in AI explanations, though it is incremental as it builds on existing randomized smoothing techniques.
The paper tackles the problem of non-robust pixel-level explanations in deep learning by introducing a certification framework that guarantees robustness for any black-box attribution method using randomized smoothing, achieving robust, interpretable, and faithful attributions across 12 methods and 5 ImageNet models.
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is at https://github.com/AlaaAnani/certified-attributions.