AIApr 21

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

arXiv:2604.1857666.42 citationsh-index: 2
AI Analysis

For forecasting practitioners, BLF provides a new SOTA method with rigorous evaluation, though improvements are incremental over existing LLM-based approaches.

BLF achieves state-of-the-art binary forecasting on ForecastBench, outperforming top methods including Cassi, GPT-5, Grok 4.20, and Foresight-32B on 400 backtesting questions. Ablations show structured belief state is nearly as impactful as web search, with shrinkage aggregation and hierarchical calibration providing significant gains.

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is almost as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes