AI CLMay 23, 2024

Dissociation of Faithful and Unfaithful Reasoning in LLMs

Evelyn Yee, Alice Li, Chenyu Tang, Yeon Ho Jung, Ramamohan Paturi, Leon Bergen

arXiv:2405.15092v220.925 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of unreliable reasoning in LLMs for improving model interpretability and trustworthiness, though it is incremental in analyzing existing mechanisms.

The study investigated how large language models (LLMs) recover from errors in Chain of Thought reasoning, finding evidence of unfaithful reasoning where models produce correct answers despite invalid reasoning, with recovery influenced by error obviousness and contextual evidence.

Large language models (LLMs) often improve their performance in downstream tasks when they generate Chain of Thought reasoning text before producing an answer. We investigate how LLMs recover from errors in Chain of Thought. Through analysis of error recovery behaviors, we find evidence for unfaithfulness in Chain of Thought, which occurs when models arrive at the correct answer despite invalid reasoning text. We identify factors that shift LLM recovery behavior: LLMs recover more frequently from obvious errors and in contexts that provide more evidence for the correct answer. Critically, these factors have divergent effects on faithful and unfaithful recoveries. Our results indicate that there are distinct mechanisms driving faithful and unfaithful error recoveries. Selective targeting of these mechanisms may be able to drive down the rate of unfaithful reasoning and improve model interpretability.

View on arXiv PDF Code

Similar