CLSep 4, 2023

Benchmarking Large Language Models in Retrieval-Augmented Generation

arXiv:2309.01431v2548 citations
Originality Incremental advance
AI Analysis

This work addresses the lack of rigorous evaluation for retrieval-augmented generation (RAG) in LLMs, providing a benchmark to identify bottlenecks for researchers and practitioners in natural language processing.

The paper systematically benchmarks six large language models (LLMs) on a new Retrieval-Augmented Generation Benchmark (RGB) to evaluate their performance in noise robustness, negative rejection, information integration, and counterfactual robustness, revealing that while LLMs show some noise robustness, they struggle significantly in the other areas.

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes