LGAIApr 17

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

arXiv:2604.1585969.4h-index: 3
Predicted impact top 26% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners needing reliable numerical forecasts from LLMs, this benchmark reveals a critical gap in current models' calibration and uncertainty quantification.

The paper introduces QuantSightBench, a benchmark for evaluating LLMs on quantitative forecasting with prediction intervals. None of 11 frontier models achieved the 90% coverage target; top performers (Gemini 3.1 Pro, Grok 4, GPT-5.4) fell at least 10 percentage points short, with systematic overconfidence at extreme magnitudes.

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes