IR AI CLSep 30, 2025

Auto-ARGUE: LLM-Based Report Generation Evaluation

William Walden, Marc Mason, Orion Weller, Laura Dietz, John Conroy, Neil Molino, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Dawn Lawrie, James Mayfield

arXiv:2509.26184v419.115 citationsh-index: 20Has Code

Originality Synthesis-oriented

AI Analysis

This provides a tool for researchers and practitioners to evaluate report generation systems, but it is incremental as it implements an existing framework.

The paper tackled the lack of evaluation tools for report generation in RAG systems by introducing Auto-ARGUE, an LLM-based implementation of the ARGUE framework, and showed good system-level correlations with human judgments on the TREC 2024 NeuCLIR track.

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.

View on arXiv PDF

Similar