Backward Compatibility in Attributive Explanation and Enhanced Model Training Method
This addresses the issue of explanation instability during model updates for real-world ML/AI applications, though it is incremental as it builds on existing explanation methods.
The paper tackles the problem of model updates causing detrimental changes in feature attribution explanations by introducing BCX, a metric for evaluating backward compatibility, and BCXR, a training method that improves agreement between pre- and post-update models, achieving superior trade-offs on eight real-world datasets.
Model update is a crucial process in the operation of ML/AI systems. While updating a model generally enhances the average prediction performance, it also significantly impacts the explanations of predictions. In real-world applications, even minor changes in explanations can have detrimental consequences. To tackle this issue, this paper introduces BCX, a quantitative metric that evaluates the backward compatibility of feature attribution explanations between pre- and post-update models. BCX utilizes practical agreement metrics to calculate the average agreement between the explanations of pre- and post-update models, specifically among samples on which both models accurately predict. In addition, we propose BCXR, a BCX-aware model training method by designing surrogate losses which theoretically lower bounds agreement scores. Furthermore, we present a universal variant of BCXR that improves all agreement metrics, utilizing L2 distance among the explanations of the models. To validate our approach, we conducted experiments on eight real-world datasets, demonstrating that BCXR achieves superior trade-offs between predictive performances and BCX scores, showcasing the effectiveness of our BCXR methods.