Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning
This work addresses the problem of inefficient and ineffective planning for AI agents in interactive, sparse-reward environments, offering a novel framework that could enhance agentic capabilities, though it appears incremental as it builds on existing reasoning models and planning methods.
The paper tackles the challenges of applying large language reasoning models to multi-round agentic planning in sparse-reward environments by proposing BPO, a three-stage framework that establishes a self-improving data flywheel, achieving state-of-the-art results with significant token efficiency on benchmarks like ALFWorld, ScienceWorld, and WebShop.
Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.