TimeSeek: Temporal Reliability of Agentic Forecasters
This addresses the problem of evaluating temporal reliability in LLM forecasters for prediction markets, though it is incremental as it provides descriptive results without new methods.
The paper introduced TimeSeek, a benchmark to study how the reliability of agentic LLM forecasters changes over time in prediction markets, finding that models are most competitive early in a market's life and on high-uncertainty markets, with web search improving pooled Brier Skill Score overall but hurting in 12% of cases.
We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market's life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.