CLAIAug 15, 2025

Can we Evaluate RAGs with Synthetic Data?

arXiv:2508.11758v22 citationsh-index: 17
Originality Incremental advance
AI Analysis

This addresses the problem of benchmark scarcity for RAG evaluation, though it is incremental as it highlights limitations in synthetic data.

The paper investigates whether synthetic QA data from LLMs can replace human-labeled benchmarks for evaluating RAG systems, finding it reliable for ranking retriever configurations but inconsistent for generator architectures.

We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when the latter is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they do not consistently produce reliable RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes