A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations
This addresses the need for transparent AI in high-stakes decisions by improving explanation consistency, though it is incremental as it builds on existing methods like weight of evidence.
The paper tackled the problem of ensuring faithfulness in free-text explanations from language models by measuring Prediction-EXplanation (PEX) consistency, revealing that over 62% of explanations lack consistency, and improved it by up to 292.3% using direct preference optimization.
Faithful free-text explanations are important to ensure transparency in high-stakes AI decision-making contexts, but they are challenging to generate by language models and assess by humans. In this paper, we present a measure for Prediction-EXplanation (PEX) consistency, by extending the concept of weight of evidence. This measure quantifies how much a free-text explanation supports or opposes a prediction, serving as an important aspect of explanation faithfulness. Our analysis reveals that more than 62% explanations generated by large language models lack this consistency. We show that applying direct preference optimization improves the consistency of generated explanations across three model families, with improvement ranging from 43.1% to 292.3%. Furthermore, we demonstrate that optimizing this consistency measure can improve explanation faithfulness by up to 9.7%.