Auto-ARGUE: LLM-Based Report Generation Evaluation
This provides a tool for researchers and practitioners to evaluate report generation systems, but it is incremental as it implements an existing framework.
The paper tackled the lack of evaluation tools for report generation in RAG systems by introducing Auto-ARGUE, an LLM-based implementation of the ARGUE framework, and showed good system-level correlations with human judgments on the TREC 2024 NeuCLIR track.
Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.