AIMay 7

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Marc Boubnovski Martell, Josefa Lia Stoisser, Kaspar Märtens, Jialin Yu, Robert Kitchen, Philip Torr, Jesper Ferkinghoff-Borg

arXiv:2605.0630852.6

AI Analysis

For practitioners deploying CoT reasoning through text-only APIs, this provides a more sample-efficient confidence estimation method that outperforms the standard self-consistency baseline.

The paper proposes a black-box trajectory-confidence score for chain-of-thought reasoning that uses sliding-window embeddings and a one-parameter softmax to measure convergence to answer anchors, requiring no logits or hidden states. Across six benchmark-reasoner settings, fusing this geometry score with coverage and verbalized-confidence channels at K=4 achieves Pareto improvements over self-consistency at K=8, with median AUC 0.78 vs 0.71 (deltaAUC=+0.075).

Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer switching and single-vendor artifacts. Geometry peaks in the penultimate window across benchmarks and reasoners, and inverts at the terminal window on GPQA Diamond. Three unscaffolded regimes separate black-box confidence into a judge-mediated Coverage prior (C), within-trace Geometry (G), and a conditional Verbalization channel (V). Across 18 benchmark x reasoner x proposer settings, C and G provide independent signal in 18/18 and 16/18, while V contributes residual signal in 6/18. Swapping the judge from GPT-5-mini to Claude Sonnet 4.6 leaves G-only AUC unchanged (|delta|<=0.013) and shifts C-only AUC by at most +/-0.02 (kappa=0.82). Fusion beats the best single channel in 17/18 settings (median AUC 0.78, max 0.92).

View on arXiv PDF

Similar