IRAICLSep 30, 2025

Auto-ARGUE: LLM-Based Report Generation Evaluation

arXiv:2509.26184v415 citationsh-index: 20Has Code
Originality Synthesis-oriented
AI Analysis

This provides a tool for researchers and practitioners to evaluate report generation systems, but it is incremental as it implements an existing framework.

The paper tackled the lack of evaluation tools for report generation in RAG systems by introducing Auto-ARGUE, an LLM-based implementation of the ARGUE framework, and showed good system-level correlations with human judgments on the TREC 2024 NeuCLIR track.

Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes