AIMay 29

Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

arXiv:2605.3082483.9h-index: 3

Predicted impact top 30% in AI · last 90 daysOriginality Highly original

AI Analysis

This work provides a more effective training paradigm for LLMs performing complex, multi-branch deep research tasks, which is significant for researchers and users needing comprehensive, long-form answers.

This paper addresses the challenge of training LLMs for deep research tasks by proposing DecomposeR, a planner-centric framework that explicitly represents research plans as typed directed acyclic graphs. By training a Qwen3-8B model in two stages, focusing on planner RL for graph structure and query decomposition, and answerer RL for execution and synthesis, DecomposeR achieves 5.1-8.0 point improvements over strong open baselines on long-form benchmarks.

Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.

View on arXiv PDF

Similar