Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs
This addresses a critical limitation in error detection for LLMs, which is important for ensuring truthfulness in AI applications, but it is incremental as it builds on existing detection methods.
The paper tackles the problem of self-consistent errors in large language models, where models repeatedly generate the same incorrect responses, and finds that current detection methods struggle with these errors, while proposing a cross-model probe method that improves performance across three LLM families.
As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatedly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as the LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improvement. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.