LG AIJun 9, 2025

Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic

Zhenjiang Mao, Artem Bisliouk, Rohith Reddy Nama, Ivan Ruchkin

arXiv:2506.08243v117.99 citationsh-index: 5BEA

Originality Incremental advance

AI Analysis

This addresses the risk of incorrect outputs in domains like education, where users may lack expertise to assess reasoning, but it is incremental as it builds on existing Chain-of-Thought prompting and confidence estimation techniques.

The paper tackles the problem of large language models producing highly confident but incorrect outputs in mathematical reasoning tasks by proposing a framework that models stepwise confidence as a temporal signal using Signal Temporal Logic (STL). The result shows consistent improvements in calibration metrics and more reliable uncertainty estimates compared to conventional methods.

Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particular, we define formal STL-based constraints to capture desirable temporal properties and compute robustness scores that serve as structured, interpretable confidence estimates. Our approach also introduces a set of uncertainty reshaping strategies to enforce smoothness, monotonicity, and causal consistency across the reasoning trajectory. Experiments show that our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.

View on arXiv PDF

Similar