Rethinking Stability for Attribution-based Explanations
This addresses the critical need for stable explanations to ensure model trustworthiness in high-stakes applications, though it is incremental as it builds on prior work on explanation instability.
The paper tackled the problem of unstable attribution-based explanations in machine learning models, introducing new Relative Stability metrics to quantify explanation instability and showing that several popular methods are unstable across three real-world datasets.
As attribution-based explanation methods are increasingly used to establish model trustworthiness in high-stakes situations, it is critical to ensure that these explanations are stable, e.g., robust to infinitesimal perturbations to an input. However, previous works have shown that state-of-the-art explanation methods generate unstable explanations. Here, we introduce metrics to quantify the stability of an explanation and show that several popular explanation methods are unstable. In particular, we propose new Relative Stability metrics that measure the change in output explanation with respect to change in input, model representation, or output of the underlying predictor. Finally, our experimental evaluation with three real-world datasets demonstrates interesting insights for seven explanation methods and different stability metrics.