RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
This addresses the issue of latency and brittleness in RAG methods for LLMs, offering a domain-specific improvement for question-answering tasks.
The paper tackles the problem of LLMs generating hallucinated or outdated content by introducing RAG-R1, a two-stage training framework using multi-query parallelism, which improves reasoning robustness and reduces inference latency, outperforming baselines by up to 13.7% and decreasing inference time by 11.1%.
Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.