CL AI LGJan 15, 2024

Are self-explanations from Large Language Models faithful?

Andreas Madsen, Sarath Chandar, Siva Reddy

MILA

arXiv:2401.07927v427.6118 citationsh-index: 43Has CodeACL

Originality Incremental advance

AI Analysis

This addresses the risk of unsupported confidence in LLMs due to misleading self-explanations, which is an incremental improvement in interpretability-faithfulness measurement.

The paper tackled the problem of measuring whether self-explanations from Large Language Models (LLMs) faithfully reflect model behavior, proposing self-consistency checks as a method, and found that faithfulness varies by explanation type, model, and task, with examples showing different models performing better with specific explanation types in sentiment classification.

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

View on arXiv PDF Code

Similar