CLAIFeb 21, 2025

Forecasting Frontier Language Model Agent Capabilities

arXiv:2502.15850v23 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of societal preparedness by forecasting LM agent performance, but it is incremental as it builds on existing forecasting approaches without major methodological breakthroughs.

The paper tackles forecasting the capabilities of language model agents by evaluating six methods and validating them on a dataset of 38 LMs, predicting that by early 2026, non-specialized agents will achieve a 54% success rate on SWE-Bench Verified, while state-of-the-art agents will reach 87%.

As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release Date$\to$Elo$\to$Benchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench (ML research engineering). Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate. Our approach does not account for recent advances in inference-compute scaling and might thus be too conservative.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes