AIMay 26

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

arXiv:2605.2771242.0
AI Analysis

For researchers and practitioners using LLMs for complex reasoning, this work provides a principled online inference framework for reliability estimation, though the gains are incremental over strong baselines.

The paper introduces Sequential Bayesian Belief Tracking (SBBT) for estimating the probability of eventual success given partial reasoning traces. SBBT improves Brier scores with scalar scores and achieves up to +0.110 AUROC gain in hard math settings using structure-aware evidence, while revealing that scalar scores primarily aid calibration and structure-aware signals improve ranking only when baseline methods have not already captured the rank evidence.

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes