AIApr 12

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen, Xiao Yang, Xu Yang, Tianming Sha, Qizheng Li, Zhuo Wang, Bowen Xian, Fang Kong, Weiqing Liu, Jiang Bian

arXiv:2604.1054791.01 citationsh-index: 17Has Code

Predicted impact top 18% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners in LLM alignment and RL, this benchmark reveals that current LLM agents struggle with autonomous RL engineering beyond simple tasks, highlighting a critical gap in agent capabilities.

The paper introduces Agent^2 RL-Bench, a benchmark to evaluate whether LLM agents can autonomously design and run RL post-training pipelines. Results show agents achieve dramatic gains on ALFWorld (5.97 to 93.28) but marginal progress on others (DeepSearchQA: +2.75), with driver choice significantly impacting interactive tasks.

We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training -- whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels -- from static rule-based training to closed-loop online RL with trajectory collection -- each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains -- on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts -- yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks -- within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.

View on arXiv PDF Code

Similar