CLJun 9, 2025

Towards Large Language Models with Self-Consistent Natural Language Explanations

Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser

arXiv:2506.07523v28.32 citationsh-index: 13

Originality Incremental advance

AI Analysis

This addresses the issue of untrustworthy explanations in LLMs for users relying on interpretability, though it is incremental as it builds on existing methods like DPO.

The paper tackles the problem of inconsistent post-hoc explanations from large language models (LLMs) by introducing a large-scale benchmark (PSCB) and showing that standard metrics fail to distinguish explanation quality; they propose an alternative metric and fine-tune LLMs with DPO, achieving significantly better alignment between explanations and decision-relevant features.

Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.

View on arXiv PDF

Similar