CLJul 19, 2024

Evaluating the Reliability of Self-Explanations in Large Language Models

arXiv:2407.14487v24 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable AI explanations for users needing interpretability, though it is incremental by proposing a tailored prompting approach.

The paper investigates the reliability of self-explanations from large language models (LLMs) for classification tasks, finding that while they correlate with human judgment, they do not fully capture the model's decision process, but counterfactual explanations can bridge this gap by being faithful and verifiable.

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes