CLAIFeb 7, 2025

LLMs Can Teach Themselves to Better Predict the Future

arXiv:2502.05253v16 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the problem of improving forecasting accuracy in LLMs for applications requiring future predictions, though it is incremental as it builds on existing fine-tuning methods.

The paper tackles the problem of enhancing large language models' forecasting capabilities by introducing an outcome-driven fine-tuning framework that uses model self-play to generate and rank reasoning trajectories based on actual outcomes, then fine-tunes via Direct Preference Optimization. The result is a 7-10% increase in prediction accuracy for models like Phi-4 14B and DeepSeek-R1 14B, matching the performance of larger frontier models like GPT-4o.

We present an outcome-driven fine-tuning framework that enhances the forecasting capabilities of large language models (LLMs) without relying on human-curated reasoning samples. Our method leverages model self-play to generate pairs of diverse reasoning trajectories and probabilistic forecasts for a set of diverse questions that resolve after the models' knowledge cutoff date. We then rank pairs of these reasoning traces by their distance to the actual outcomes before fine-tuning the model via Direct Preference Optimization (DPO). On a separate test set, our approach increases prediction accuracy of Phi-4 14B and DeepSeek-R1 14B by between 7--10\% over a base model and a DPO fine-tuned control model with randomized labels, bringing them on par with forecasting capabilities of much larger frontier models like GPT-4o.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes