CLFeb 7, 2025

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze

arXiv:2502.05167v334.496 citationsh-index: 15Has CodeICML

Originality Incremental advance

AI Analysis

This addresses the problem of overestimating LLM capabilities in long-context scenarios for researchers and developers, though it is incremental as it builds on existing needle-in-a-haystack benchmarks.

The paper tackles the problem that existing long-context evaluation benchmarks for LLMs rely on literal matching, which simplifies the task, by introducing NoLiMa, a benchmark that requires inferring latent associations with minimal lexical overlap. The result shows significant performance degradation for 13 popular LLMs as context length increases, with most dropping below 50% of their short-length baselines at 32K tokens, and even top models like GPT-4o falling from 99.3% to 69.7%.

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

View on arXiv PDF Code

Similar