Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
It addresses the problem of practical RL scaling for autonomous agents in complex environments, providing incremental insights for researchers and practitioners in AI and robotics.
This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents by conducting a systematic empirical study on the TravelPlanner testbed, resulting in a recipe that achieves state-of-the-art performance and outperforms leading LLMs.
Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.