AIMay 21

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

arXiv:2605.2267272.0
AI Analysis

For practitioners using LLMs for forecasting in high-stakes domains like finance and epidemiology, this work reveals that model capability can degrade forecast quality in critical tail regions, challenging the assumption that larger models are universally better.

The paper documents inverse scaling in LLMs on forecasting problems with superlinear growth and tail risk, where more capable models produce worse distributional forecasts, with failures concentrated at the upper tail. This effect is observed across synthetic and real-world datasets and reverses the capability-accuracy relationship when using tail-inclusive scoring instead of single-threshold metrics.

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes