RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis
This work addresses the data bottleneck for developing robust agentic RAG systems, which is crucial for applications requiring autonomous problem-solving in noisy environments, representing a novel method for a known bottleneck.
The paper tackled the problem of training robust agentic RAG systems by addressing the scarcity of high-quality data that reflects real-world retrieval noise and complexity, and introduced RAGShaper, a data synthesis framework that automates task construction and agent trajectory generation, resulting in models that significantly outperform existing baselines with superior robustness in noise-intensive tasks.
Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.