CLDec 19, 2024

Understanding the Dark Side of LLMs' Intrinsic Self-Correction

arXiv:2412.14959v242 citationsh-index: 10Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses reliability issues in LLMs' self-correction for users relying on AI-generated responses, though it is incremental as it builds on known limitations.

The paper investigates the failure cases of intrinsic self-correction in LLMs, revealing that it can cause answer wavering and introduce biases on tasks ranging from simple factual questions to complex ones, and proposes strategies like question repeating and fine-tuning for mitigation.

Intrinsic self-correction was proposed to improve LLMs' responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs' intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs' intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs' intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes