CLAIOct 28, 2025

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

arXiv:2510.24345v12 citationsh-index: 17EMNLP
AI Analysis

This addresses the problem of evaluating long-form generation for AI researchers and developers, though it is incremental as it builds on existing benchmark approaches.

The paper tackles the challenge of generating long, informative, and factual outputs with Large Language Models by introducing LongWeave, a benchmark that balances real-world relevance and verifiability, showing that state-of-the-art models face significant difficulties as complexity and length increase.

Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes