Evaluating LLMs on Real-World Forecasting Against Expert Forecasters
This addresses the problem of assessing LLMs' forecasting abilities for AI researchers and practitioners, but it is incremental as it builds on existing benchmarks.
The study evaluated state-of-the-art LLMs on 464 forecasting questions from Metaculus, finding that frontier models achieved Brier scores that surpassed the human crowd but still significantly underperformed a group of experts.
Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against top forecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of experts.