Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

arXiv:2605.2149183.8

AI Analysis

This work addresses the bottleneck of evaluating AI-generated research ideas for autonomous scientific discovery, providing a scalable verification method.

The authors investigate whether language models can forecast the empirical success of research ideas before experiments, constructing a dataset of 11,488 idea pairs from PapersWithCode. They achieve 77.1% accuracy with supervised fine-tuning, outperforming GPT-5 (61.1%), and demonstrate robustness and transferability.

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

View on arXiv PDF

Similar