LGMar 27, 2025

Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?

arXiv:2503.21157v33 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

It addresses the problem of automatically evaluating RAG systems for hallucinations, which is crucial for improving reliability in AI applications, but it is incremental as it focuses on benchmarking existing methods.

This paper surveys evaluation models for detecting hallucinations in Retrieval-Augmented Generation (RAG) and benchmarks their performance across six applications, finding that some approaches achieve high precision and recall in detecting incorrect responses.

This article surveys Evaluation models to automatically detect hallucinations in Retrieval-Augmented Generation (RAG), and presents a comprehensive benchmark of their performance across six RAG applications. Methods included in our study include: LLM-as-a-Judge, Prometheus, Lynx, the Hughes Hallucination Evaluation Model (HHEM), and the Trustworthy Language Model (TLM). These approaches are all reference-free, requiring no ground-truth answers/labels to catch incorrect LLM responses. Our study reveals that, across diverse RAG applications, some of these approaches consistently detect incorrect RAG responses with high precision/recall.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes