AIOct 10, 2025

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

arXiv:2510.08945v23.3h-index: 7Has Code

Originality Incremental advance

AI Analysis

This provides a new benchmark for researchers and practitioners to evaluate multimodal RAG systems, though it is incremental as it builds on existing RAG evaluation efforts.

The authors tackled the problem of evaluating multimodal retrieval-augmented generation (RAG) pipelines by introducing a benchmark that assesses ingestion, retrieval, and reasoning across text, tables, and images, finding that closed-source pipelines significantly outperform open-source ones in correctness and hallucination metrics, with wider gaps for multimodal and cross-document questions.

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

View on arXiv PDF

Similar