Language Models (Mostly) Know When to Stop Reading
This addresses the problem of computational inefficiency in LLMs for users handling long contexts, offering a novel efficiency paradigm rather than incremental improvements.
The paper tackles the inefficiency of large language models processing entire input contexts by introducing dynamic context cutoff, a method that allows models to self-terminate processing when sufficient task-relevant information is acquired, resulting in a 3.4% accuracy improvement and 1.33x token reduction on average across six QA datasets.
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient when the information required to answer a query is localized within the context. We present dynamic context cutoff, a novel method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode "sufficiency signals" -- detectable through lightweight classifiers -- that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate 3.4% accuracy improvement while achieving 1.33x token reduction on average. Furthermore, our method demonstrates superior performance compared to other context efficiency methods at equivalent token reduction rates. Additionally, we observe an emergent scaling phenomenon: while smaller models require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.