Towards Faithful Neural Network Intrinsic Interpretation with Shapley Additive Self-Attribution
This addresses the need for faithful interpretability in neural networks for researchers and practitioners, though it is incremental as it builds on existing additive self-attribution frameworks.
The paper tackles the problem of self-interpreting neural networks lacking theoretical guarantees or compromising expressiveness by proposing SASANet, which ensures self-attribution values equal to Shapley values, and it surpasses existing self-attributing models in performance while rivaling black-box models.
Self-interpreting neural networks have garnered significant interest in research. Existing works in this domain often (1) lack a solid theoretical foundation ensuring genuine interpretability or (2) compromise model expressiveness. In response, we formulate a generic Additive Self-Attribution (ASA) framework. Observing the absence of Shapley value in Additive Self-Attribution, we propose Shapley Additive Self-Attributing Neural Network (SASANet), with theoretical guarantees for the self-attribution value equal to the output's Shapley values. Specifically, SASANet uses a marginal contribution-based sequential schema and internal distillation-based training strategies to model meaningful outputs for any number of features, resulting in un-approximated meaningful value function. Our experimental results indicate SASANet surpasses existing self-attributing models in performance and rivals black-box models. Moreover, SASANet is shown more precise and efficient than post-hoc methods in interpreting its own predictions.