LGAICLSep 30, 2024

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

BerkeleyDeepMind
arXiv:2409.19839v566 citationsh-index: 91
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating AI forecasting capabilities for researchers and practitioners, though it is incremental as it builds on existing benchmarking practices.

The authors tackled the lack of a standardized framework for evaluating machine learning systems' forecasting accuracy by introducing ForecastBench, a dynamic benchmark with 1,000 automatically generated questions about future events, and found that expert human forecasters outperformed the top-performing LLM with statistical significance (p-value <0.001).

Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark ($N=200$). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM ($p$-value $<0.001$). We display system and human scores in a public leaderboard at www.forecastbench.org.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes