CLAIJun 14, 2024

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

arXiv:2406.10149v2218 citations
Originality Incremental advance
AI Analysis

This addresses the gap in comprehensive evaluation methods for long-context reasoning in LLMs, which is crucial for researchers and developers working on improving model efficiency and scalability, though it is incremental as it builds on existing benchmarking approaches.

The researchers tackled the problem of evaluating large language models' ability to reason across facts in extremely long contexts by introducing the BABILong benchmark, which includes 20 diverse reasoning tasks, and found that popular LLMs effectively use only 10-20% of the context with performance declining sharply with increased complexity, while retrieval-augmented generation methods achieve 60% accuracy on single-fact questions and recurrent memory transformers after fine-tuning can process up to 50 million tokens.

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers after fine-tuning, enabling the processing of lengths up to 50 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 10 million token lengths.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes