Investigating Saturation Effects in Integrated Gradients
This work addresses interpretability issues for machine learning practitioners, but it is incremental as it modifies an existing method rather than introducing a new paradigm.
The authors tackled the problem of understanding saturation effects in Integrated Gradients, a popular interpretability method, and found that gradients in saturated regions disproportionately affect attributions. They proposed a variant that focuses on unsaturated regions, showing higher model faithfulness and lower noise sensitivity on ImageNet classification networks.
Integrated Gradients has become a popular method for post-hoc model interpretability. De-spite its popularity, the composition and relative impact of different regions of the integral path are not well understood. We explore these effects and find that gradients in saturated regions of this path, where model output changes minimally, contribute disproportionately to the computed attribution. We propose a variant of IntegratedGradients which primarily captures gradients in unsaturated regions and evaluate this method on ImageNet classification networks. We find that this attribution technique shows higher model faithfulness and lower sensitivity to noise com-pared with standard Integrated Gradients. A note-book illustrating our computations and results is available at https://github.com/vivekmig/captum-1/tree/ExpandedIG.