CLLGJan 19

Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models

arXiv:2601.13368v13 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses uncertainty assessment to prevent misleading hallucinations for users of large language models, representing an incremental improvement over existing methods.

The paper tackles the problem of uncertainty quantification in large language models with reasoning modules, where existing methods overlook temporal confidence spread, leading to inflated confidence. The proposed method incorporates inter-step attention and hidden confidence mechanisms, outperforming state-of-the-art methods on GAOKAO math and CLadder causal reasoning benchmarks with strong performance on Negative Log-Likelihood and Expected Calibration Error.

As reasoning modules, such as the chain-of-thought mechanism, are applied to large language models, they achieve strong performance on various tasks such as answering common-sense questions and solving math problems. The main challenge now is to assess the uncertainty of answers, which can help prevent misleading or serious hallucinations for users. Although current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences, the temporal spread of confidence is often overlooked. This oversight can lead to inflated overall confidence, even when earlier steps exhibit very low confidence. To address this issue, we propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps. For handling long-horizon responses, we introduce a hidden confidence mechanism to retain historical confidence information, which is then combined with stepwise confidence to produce a more accurate overall estimate. We evaluate our method on the GAOKAO math benchmark and the CLadder causal reasoning dataset using mainstream open-source large language models. Our approach is shown to outperform state-of-the-art methods by achieving a superior balance between predictive quality and calibration, demonstrated by strong performance on both Negative Log-Likelihood and Expected Calibration Error.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes