AI APJul 29, 2025

Towards a rigorous evaluation of RAG systems: the challenge of due diligence

Grégoire Martinon, Alexandra Lorenzo de Brionne, Jérôme Bohard, Antoine Lojou, Damien Hervault, Nicolas J-B. Brunel

arXiv:2507.21753v11 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses the challenge of ensuring reliable RAG systems for critical industrial applications like due diligence in investment funds, though it is incremental in improving evaluation protocols.

The study tackled the problem of evaluating the reliability of Retrieval-Augmented Generation (RAG) systems in high-risk sectors like finance by proposing a robust evaluation protocol combining human and LLM-Judge annotations to identify failures such as hallucinations, achieving precise performance measurements with statistical guarantees.

The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance. The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora. Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. This study evaluates a RAG system used in due diligence for an investment fund. We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees. We provide a comprehensive dataset for further analysis. Our contributions aim to enhance the reliability and scalability of RAG systems evaluation protocols in industrial applications.

View on arXiv PDF

Similar