RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
This addresses the need for more realistic robustness testing in RAG systems, which is crucial for applications like finance and policy, though it is incremental as it builds on existing evaluation methods.
The paper tackles the problem of evaluating the robustness of Retrieval-Augmented Generation (RAG) systems to real-world noise and dynamic data, introducing the RARE framework and benchmark, which reveals that RAG systems are unexpectedly sensitive to perturbations and show lower robustness on multi-hop queries across domains.
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model's ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.