CL AI LG MLApr 19, 2025

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Katie Matton, Robert Osazuwa Ness, John Guttag, Emre Kıcıman

arXiv:2504.14150v227.443 citationsh-index: 8Has CodeICLR

Originality Incremental advance

AI Analysis

This addresses the issue of trust and safety in AI for users relying on LLM explanations, though it is incremental as it builds on existing work in interpretability.

The paper tackles the problem of unfaithful explanations from large language models (LLMs), which can misrepresent reasoning and lead to misuse, by introducing a new method to measure faithfulness based on concept-level causal effects, uncovering cases where explanations hide social bias or provide misleading evidence in tasks like social bias and medical question answering.

Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. However, these explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. This, in turn, can lead to over-trust and misuse. We introduce a new approach for measuring the faithfulness of LLM explanations. First, we provide a rigorous definition of faithfulness. Since LLM explanations mimic human explanations, they often reference high-level concepts in the input question that purportedly influenced the model. We define faithfulness in terms of the difference between the set of concepts that LLM explanations imply are influential and the set that truly are. Second, we present a novel method for estimating faithfulness that is based on: (1) using an auxiliary LLM to modify the values of concepts within model inputs to create realistic counterfactuals, and (2) using a Bayesian hierarchical model to quantify the causal effects of concepts at both the example- and dataset-level. Our experiments show that our method can be used to quantify and discover interpretable patterns of unfaithfulness. On a social bias task, we uncover cases where LLM explanations hide the influence of social bias. On a medical question answering task, we uncover cases where LLM explanations provide misleading claims about which pieces of evidence influenced the model's decisions.

View on arXiv PDF Code

Similar