Towards a Unified Framework for Evaluating Explanations
This work addresses the challenge of standardizing evaluation methods for interpretability, which is crucial for researchers and practitioners developing explainable AI systems, though it is incremental as it builds on existing criteria.
The paper tackles the problem of evaluating interpretability in machine learning by reviewing how ML and HCI communities assess explanations, proposing a unified framework based on criteria like faithfulness and intelligibility, and illustrating it with an example from a neural network study.
The challenge of creating interpretable models has been taken up by two main research communities: ML researchers primarily focused on lower-level explainability methods that suit the needs of engineers, and HCI researchers who have more heavily emphasized user-centered approaches often based on participatory design methods. This paper reviews how these communities have evaluated interpretability, identifying overlaps and semantic misalignments. We propose moving towards a unified framework of evaluation criteria and lay the groundwork for such a framework by articulating the relationships between existing criteria. We argue that explanations serve as mediators between models and stakeholders, whether for intrinsically interpretable models or opaque black-box models analyzed via post-hoc techniques. We further argue that useful explanations require both faithfulness and intelligibility. Explanation plausibility is a prerequisite for intelligibility, while stability is a prerequisite for explanation faithfulness. We illustrate these criteria, as well as specific evaluation methods, using examples from an ongoing study of an interpretable neural network for predicting a particular learner behavior.