CLJul 23, 2025

PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu

arXiv:2507.22927v12.73 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This provides a systematic framework for benchmarking LLMs in RAG systems, addressing a domain-specific need for more reliable and efficient AI applications, though it is incremental as it builds on existing RAG evaluation methods.

The paper tackles the lack of granular evaluation for LLM-specific capabilities in Retrieval-Augmented Generation (RAG) systems by introducing the PRGB benchmark, which uses a placeholder-based approach to assess multi-level filtering, combination, and reference reasoning, revealing limitations in error resilience and context faithfulness of representative LLMs.

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.

View on arXiv PDF Code

Similar