CLNov 11, 2024

On Many-Shot In-Context Learning for Long-Context Evaluation

arXiv:2411.07130v38 citationsh-index: 6ACL
Originality Incremental advance
AI Analysis

This work addresses the evaluation of long-context language models for researchers and practitioners, providing insights into task-specific performance and a new benchmark, though it is incremental as it builds on existing many-shot ICL methods.

The paper investigates how many-shot in-context learning tasks benefit from additional demonstrations for evaluating long-context language models, finding that classification and summarization improve while translation and reasoning do not, and introduces a benchmark showing state-of-the-art models perform well up to 64k tokens in retrieval-based tasks but drop significantly at 16k tokens in tasks requiring global understanding.

Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark, MANYICLBENCH, to characterize model's ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes