AIRA_2: Overcoming Bottlenecks in AI Research Agents
This addresses performance limitations in AI research agents for researchers, though it is incremental as it builds on prior work to improve specific bottlenecks.
The paper tackled bottlenecks in AI research agents, such as synchronous single-GPU execution and generalization gaps, by introducing AIRA_2 with architectural improvements like asynchronous multi-GPU workers and a Hidden Consistent Evaluation protocol, achieving a mean Percentile Rank of 71.8% at 24 hours and 76.0% at 72 hours on MLE-bench-30.
Existing research has identified three structural performance bottlenecks in AI research agents: (1) synchronous single-GPU execution constrains sample throughput, limiting the benefit of search; (2) a generalization gap where validation-based selection causes performance to degrade over extended search horizons; and (3) the limited capability of fixed, single-turn LLM operators imposes a ceiling on search performance. We introduce AIRA$_2$, which addresses these bottlenecks through three architectural choices: an asynchronous multi-GPU worker pool that increases experiment throughput linearly; a Hidden Consistent Evaluation protocol that delivers a reliable evaluation signal; and ReAct agents that dynamically scope their actions and debug interactively. On MLE-bench-30, AIRA$_2$ achieves a mean Percentile Rank of 71.8% at 24 hours - surpassing the previous best of 69.9% - and steadily improves to 76.0% at 72 hours. Ablation studies reveal that each component is necessary and that the "overfitting" reported in prior work was driven by evaluation noise rather than true data memorization.