Probabilistic Stability Guarantees for Feature Attributions
This work addresses the need for reliable and efficient stability certification in explanation methods for machine learning models, offering a practical solution for researchers and practitioners in interpretable AI.
The paper tackled the problem of providing stability guarantees for feature attributions in machine learning, introducing a model-agnostic certification algorithm that yields non-trivial and interpretable guarantees while achieving a more favorable trade-off between accuracy and stability compared to prior methods.
Stability guarantees have emerged as a principled way to evaluate feature attributions, but existing certification methods rely on heavily smoothed classifiers and often produce conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, sample-efficient stability certification algorithm (SCA) that yields non-trivial and interpretable guarantees for any attribution method. Moreover, we show that mild smoothing achieves a more favorable trade-off between accuracy and stability, avoiding the aggressive compromises made in prior certification methods. To explain this behavior, we use Boolean function analysis to derive a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.