CLAIMay 28, 2025

Evaluating the Retrieval Robustness of Large Language Models

arXiv:2505.21870v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses the practical reliability of RAG for users of LLMs in knowledge-intensive tasks, though it is incremental in nature.

The study evaluated the robustness of large language models in retrieval-augmented generation setups, finding that while all 11 models tested showed high robustness, imperfections still limited their ability to fully benefit from RAG.

Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes