CLFeb 23

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

arXiv:2602.19526v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of making deep research agents more reliable for users in information retrieval and decision-making, though it is incremental as it builds on prior methods.

The paper tackled the problem of improving deep research agents for knowledge-intensive tasks by systematically studying reinforcement learning components, resulting in a new baseline that increased performance from 0.403 to 0.442 and 0.289 to 0.331 on specific models.

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes