CLAILGMLApr 19, 2025

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

arXiv:2504.14150v241 citationsh-index: 8ICLR
Originality Incremental advance
AI Analysis

This addresses the issue of trust and safety in AI for users relying on LLM explanations, though it is incremental as it builds on existing work in interpretability.

The paper tackles the problem of unfaithful explanations from large language models (LLMs), which can misrepresent reasoning and lead to misuse, by introducing a new method to measure faithfulness based on concept-level causal effects, uncovering cases where explanations hide social bias or provide misleading evidence in tasks like social bias and medical question answering.

Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes