CLApr 14, 2024

When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

arXiv:2404.09129v140 citationsh-index: 10Has CodeNAACL-HLT
Originality Incremental advance
AI Analysis

This work addresses the problem of understanding and optimizing self-reflection in LLMs for researchers and practitioners, providing guidelines for its application, though it is incremental in refining existing methods.

The paper investigates the effectiveness of self-reflective prompting in Large Language Models without external feedback, finding that it improves performance on TruthfulQA but harms it on HotpotQA, with benefits most pronounced when models are initially less accurate and questions are more difficult.

Recent studies suggest that self-reflective prompting can significantly enhance the reasoning capabilities of Large Language Models (LLMs). However, the use of external feedback as a stop criterion raises doubts about the true extent of LLMs' ability to emulate human-like self-reflection. In this paper, we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models' initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher. We also find that self-reflection reduces tendency toward majority voting. Based on our findings, we propose guidelines for decisions on when to implement self-reflection. We release the codebase for reproducing our experiments at https://github.com/yanhong-lbh/LLM-SelfReflection-Eval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes