CLMay 15, 2025

DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation

Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao, Yongdong Zhang

arXiv:2505.10493v22.7h-index: 85

Originality Incremental advance

AI Analysis

This work addresses training inefficiencies in RAG systems for improving LLM performance in open-domain QA, representing an incremental advancement.

The paper tackles the problem of training data quality and discriminability in Retrieval-Augmented Generation (RAG) systems by introducing DACL-RAG, a multi-stage framework combining data augmentation and curriculum learning, which achieves performance gains of 2% to 4% over advanced methods on four open-domain QA datasets.

Retrieval-Augmented Generation (RAG) is an effective method to enhance the capabilities of large language models (LLMs). Existing methods typically optimize the retriever or the generator in a RAG system by directly using the top-k retrieved documents. However, two key issues inherent in the training data constrain the effectiveness of this training paradigm: (1) across different queries, the top-k retrieved documents vary greatly in content quality, with some providing valuable knowledge while others lack critical information or are even misleading, and training on such data in a purely random manner may impair the generator's ability to extract key information; (2) for a given query, the limited set of k documents often exhibits low discriminability, and training solely on them makes it difficult for the retriever to learn how to distinguish between relevant and irrelevant documents. To address these issues, we introduce DACL-RAG, a multi-stage RAG training framework that combines a multi-level Data Augmentation strategy with a multi-stage Curriculum Learning paradigm. The data augmentation strategy constructs comprehensive and diverse training sets with controllable difficulty levels through sample evolution, while the curriculum learning paradigm organizes them into progressive stages for training, ensuring stable and consistent improvements, thereby optimizing the overall performance and generalization of the RAG system more effectively. Our DACL-RAG framework demonstrates consistent effectiveness across four open-domain QA datasets, achieving performance gains of 2% to 4% over multiple advanced methods.

View on arXiv PDF

Similar