Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
This addresses the challenge of data curation for long-context language models, which is crucial for unlocking advanced capabilities in reasoning and summarization, but it is incremental as it builds on existing pretraining methods.
The paper tackles the problem of inefficient long-context LLM pretraining due to data lacking meaningful long-range dependencies by introducing LongFilter, a framework that selects high-quality data based on information gain from extended context, resulting in substantial improvements on benchmarks like HELMET, LongBench, and RULER when extending LLaMA-3-8B's context length from 8K to 64K.
Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.