CLAIDec 2, 2025

SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys

arXiv:2512.02763v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of evaluating complex LLM-based survey systems for researchers and developers, though it is incremental as it builds on existing evaluation frameworks.

The paper tackles the challenge of evaluating LLM-generated academic surveys by introducing SurveyEval, a comprehensive benchmark that assesses quality, coherence, and accuracy across 7 subjects, showing specialized systems achieve substantially higher-quality results than general ones.

LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes