CLAIJan 22, 2025

NExtLong: Toward Effective Long-Context Training without Long Documents

arXiv:2501.12766v217 citationsh-index: 10ICML
Originality Highly original
AI Analysis

This addresses the problem of limited long-context training data for AI researchers and developers, offering an incremental improvement over prior synthesis approaches.

The paper tackles the challenge of training large language models with extended context windows despite scarce long documents by proposing NExtLong, a framework that synthesizes long-context data using negative document extension with hard distractors, achieving significant performance improvements on HELMET and RULER benchmarks compared to existing methods.

Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes