A Formal Approach to Explainability
This work provides a theoretical foundation for explainability, which is incremental as it builds on existing concepts without introducing new methods.
The paper tackles the problem of formalizing explainability in machine learning by defining properties of explanation-generating functions and studying their relationships with model layers, showing that consistency in one layer implies consistency in subsequent layers.
We regard explanations as a blending of the input sample and the model's output and offer a few definitions that capture various desired properties of the function that generates these explanations. We study the links between these properties and between explanation-generating functions and intermediate representations of learned models and are able to show, for example, that if the activations of a given layer are consistent with an explanation, then so do all other subsequent layers. In addition, we study the intersection and union of explanations as a way to construct new explanations.