Uncovering the Structure of Explanation Quality with Spectral Analysis
This work addresses the need for more reliable evaluation techniques for explanations in high-stakes machine learning domains, though it appears incremental as it builds on existing metrics.
The paper tackled the problem of unclear practical applicability of existing explanation quality metrics by proposing a spectral analysis framework to systematically capture multifaceted properties of explanation techniques, uncovering two distinct factors—stability and target sensitivity—and showing that popular evaluation techniques partially capture trade-offs between these factors on MNIST and ImageNet.
As machine learning models are increasingly considered for high-stakes domains, effective explanation methods are crucial to ensure that their prediction strategies are transparent to the user. Over the years, numerous metrics have been proposed to assess quality of explanations. However, their practical applicability remains unclear, in particular due to a limited understanding of which specific aspects each metric rewards. In this paper we propose a new framework based on spectral analysis of explanation outcomes to systematically capture the multifaceted properties of different explanation techniques. Our analysis uncovers two distinct factors of explanation quality-stability and target sensitivity-that can be directly observed through spectral decomposition. Experiments on both MNIST and ImageNet show that popular evaluation techniques (e.g., pixel-flipping, entropy) partially capture the trade-offs between these factors. Overall, our framework provides a foundational basis for understanding explanation quality, guiding the development of more reliable techniques for evaluating explanations.