CLAILGFeb 7, 2024

FaithLM: Towards Faithful Explanations for Large Language Models

arXiv:2402.04678v49 citationsh-index: 26
Originality Highly original
AI Analysis

This work addresses the critical issue of unreliable explanations in LLMs, which is essential for trustworthy AI applications, though it is incremental as it builds on existing self-explanation methods.

The authors tackled the problem of unfaithful explanations from large language models by introducing FaithLM, a model-agnostic framework that formalizes faithfulness as an intervention property and uses iterative optimization to improve it, resulting in increased faithfulness and better alignment with human rationales across multiple datasets and models.

Large language models (LLMs) increasingly produce natural language explanations, yet these explanations often lack faithfulness, and they do not reliably reflect the evidence the model uses to decide. We introduce FaithLM, a model-agnostic framework that evaluates and improves the faithfulness of LLM explanations without token masking or task-specific heuristics. FaithLM formalizes explanation faithfulness as an intervention property: a faithful explanation should yield a prediction shift when its content is contradicted. Theoretical analysis shows that the resulting contrary-hint score is a sound and discriminative estimator of faithfulness. Building on this principle, FaithLM iteratively refines both the elicitation prompt and the explanation to maximize the measured score. Experiments on three multi-domain datasets and multiple LLM backbones demonstrate that FaithLM consistently increases faithfulness and produces explanations more aligned with human rationales than strong self-explanation baselines. These findings highlight that intervention-based evaluation, coupled with iterative optimization, provides a principled route toward faithful and reliable LLM explanations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes