Attention cannot be an Explanation
This challenges a common assumption in interpretable AI, showing that attention-based explanations are ineffective for improving human trust, which is crucial for deploying black-box models in real-world applications.
The paper investigates whether attention weights can serve as explanations to increase human trust and reliance in deep neural networks, even when they correlate with feature importance, and finds through human studies that attention cannot be used as an explanation.
Attention based explanations (viz. saliency maps), by providing interpretability to black box models such as deep neural networks, are assumed to improve human trust and reliance in the underlying models. Recently, it has been shown that attention weights are frequently uncorrelated with gradient-based measures of feature importance. Motivated by this, we ask a follow-up question: "Assuming that we only consider the tasks where attention weights correlate well with feature importance, how effective are these attention based explanations in increasing human trust and reliance in the underlying models?". In other words, can we use attention as an explanation? We perform extensive human study experiments that aim to qualitatively and quantitatively assess the degree to which attention based explanations are suitable in increasing human trust and reliance. Our experiment results show that attention cannot be used as an explanation.