CVMar 7

Training for Trustworthy Saliency Maps: Adversarial Training Meets Feature-Map Smoothing

Dipkamal Bhusal, Md Tanvirul Alam, Nidhi Rastogi

arXiv:2603.07302v15.8h-index: 10

Predicted impact top 87% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the problem of noisy and unstable saliency maps for users who rely on these explanations in high-stakes settings, offering an incremental improvement to existing adversarial training methods.

This paper investigates how training procedures affect the quality of gradient-based saliency maps, which are often noisy and unstable. They found that adversarial training improves sparsity and input-side stability but degrades output-side stability. Their proposed method, which combines adversarial training with a feature-map smoothing block, preserves sparsity while improving both input-side and output-side stability across FMNIST, CIFAR-10, and ImageNette, and is perceived as more trustworthy by human participants.

Gradient-based saliency methods such as Vanilla Gradient (VG) and Integrated Gradients (IG) are widely used to explain image classifiers, yet the resulting maps are often noisy and unstable, limiting their usefulness in high-stakes settings. Most prior work improves explanations by modifying the attribution algorithm, leaving open how the training procedure shapes explanation quality. We take a training-centered view and first provide a curvature-based analysis linking attribution stability to how smoothly the input-gradient field varies locally. Guided by this connection, we study adversarial training and identify a consistent trade-off: it yields sparser and more input-stable saliency maps, but can degrade output-side stability, causing explanations to change even when predictions remain unchanged and logits vary only slightly. To mitigate this, we propose augmenting adversarial training with a lightweight feature-map smoothing block that applies a differentiable Gaussian filter in an intermediate layer. Across FMNIST, CIFAR-10, and ImageNette, our method preserves the sparsity benefits of adversarial training while improving both input-side stability and output-side stability. A human study with 65 participants further shows that smoothed adversarial saliency maps are perceived as more sufficient and trustworthy. Overall, our results demonstrate that explanation quality is critically shaped by training, and that simple smoothing with robust training provides a practical path toward saliency maps that are both sparse and stable.

View on arXiv PDF

Similar