CLIRDec 25, 2024

Bootstrap Your Own Context Length

Microsoft
arXiv:2412.18860v26 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of efficiently scaling context lengths for language models, though it is incremental as it builds on existing short-context capabilities.

The paper tackles the problem of training long-context language models by introducing a bootstrapping method that synthesizes long-context instruction tuning data using only short-context models, eliminating manual data collection. The result shows the method extends context length to up to 1M tokens with superior performance on benchmarks.

We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes