CLAILGMay 25, 2025

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

arXiv:2505.19293v14 citationsh-index: 11ACL
Originality Incremental advance
AI Analysis

This work addresses a methodological problem for researchers and developers evaluating long-context capabilities in LLMs, though it is incremental as it builds on existing benchmarks.

The paper tackles the problem that existing long-context benchmarks for LLMs lack proper metrics to separate long-context performance from baseline ability and use fixed input lengths, limiting cross-model comparison and applicability. It introduces a length-controllable benchmark and a novel metric to address these issues, demonstrating superiority in effectively evaluating LLMs.

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes