CLFeb 19, 2025

SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

arXiv:2502.13753v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This provides a reliable and sustainable benchmark for tracking progress in long-context understanding for the AI research community, though it is incremental as it builds on existing evaluation methods.

The authors tackled the challenge of evaluating large language models' long-context understanding by introducing SCALAR, a benchmark using academic papers and citation networks, which automatically generates labels and evaluates 8 state-of-the-art models on ICLR 2025 papers to reveal insights into their capabilities and limitations.

Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes