LLMs Can Teach Themselves to Better Predict the Future
This addresses the problem of improving forecasting accuracy in LLMs for applications requiring future predictions, though it is incremental as it builds on existing fine-tuning methods.
The paper tackles the problem of enhancing large language models' forecasting capabilities by introducing an outcome-driven fine-tuning framework that uses model self-play to generate and rank reasoning trajectories based on actual outcomes, then fine-tunes via Direct Preference Optimization. The result is a 7-10% increase in prediction accuracy for models like Phi-4 14B and DeepSeek-R1 14B, matching the performance of larger frontier models like GPT-4o.
We present an outcome-driven fine-tuning framework that enhances the forecasting capabilities of large language models (LLMs) without relying on human-curated reasoning samples. Our method leverages model self-play to generate pairs of diverse reasoning trajectories and probabilistic forecasts for a set of diverse questions that resolve after the models' knowledge cutoff date. We then rank pairs of these reasoning traces by their distance to the actual outcomes before fine-tuning the model via Direct Preference Optimization (DPO). On a separate test set, our approach increases prediction accuracy of Phi-4 14B and DeepSeek-R1 14B by between 7--10\% over a base model and a DPO fine-tuned control model with randomized labels, bringing them on par with forecasting capabilities of much larger frontier models like GPT-4o.