PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
This addresses the need for better benchmarks to assess global comprehension and reasoning in long contexts for AI and NLP researchers, though it is incremental as it builds on existing evaluation frameworks.
The authors tackled the problem of evaluating long-context understanding by introducing PRELUDE, a benchmark that requires determining consistency between prequel stories and original narratives, and found that state-of-the-art models lag behind humans by over 15% in accuracy.
We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.