GNN Explanations that do not Explain and How to find Them
This addresses the problem of unreliable explanations in graph neural networks for researchers and practitioners, highlighting a failure mode that can hide misuse of sensitive attributes, though it is incremental as it builds on prior work on explanation suboptimality.
The paper identifies a critical failure in self-explainable graph neural networks (SE-GNNs) where explanations can be unrelated to the model's inference, showing that many SE-GNNs achieve optimal performance while producing these degenerate explanations, and most faithfulness metrics fail to detect them. It introduces a new faithfulness metric that reliably marks such explanations as unfaithful in both malicious and natural settings.
Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.