A Unified Taylor Framework for Revisiting Attribution Methods
This work provides a theoretical foundation for understanding and improving attribution methods, which is crucial for interpretability in AI, though it is incremental as it builds upon existing methods rather than introducing a new paradigm.
The authors tackled the lack of a theoretical framework for unifying and analyzing attribution methods in machine learning by proposing a Taylor attribution framework that reformulates seven mainstream methods, and they empirically validated it by showing a positive correlation between attribution performance and adherence to established principles on real-world datasets.
Attribution methods have been developed to understand the decision-making process of machine learning models, especially deep neural networks, by assigning importance scores to individual features. Existing attribution methods often built upon empirical intuitions and heuristics. There still lacks a general and theoretical framework that not only can unify these attribution methods, but also theoretically reveal their rationales, fidelity, and limitations. To bridge the gap, in this paper, we propose a Taylor attribution framework and reformulate seven mainstream attribution methods into the framework. Based on reformulations, we analyze the attribution methods in terms of rationale, fidelity, and limitation. Moreover, We establish three principles for a good attribution in the Taylor attribution framework, i.e., low approximation error, correct contribution assignment, and unbiased baseline selection. Finally, we empirically validate the Taylor reformulations and reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on real-world datasets.