CL AI IR LGOct 23, 2024

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, Qi He

arXiv:2410.17952v216.839 citationsh-index: 12NAACL

Originality Incremental advance

AI Analysis

This work addresses the problem of domain adaptation for RAG systems in specialized fields, offering an incremental improvement through self-generated synthetic data.

The paper tackled the challenge of adapting retrieval-augmented generation (RAG) systems to specialized domains like science and medicine by proposing SimRAG, a self-training method that uses the LLM to generate and filter domain-relevant questions, resulting in performance improvements of 1.2% to 8.6% over baselines across 11 datasets.

Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.

View on arXiv PDF

Similar