CLApr 15

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Zichong Li, Chen Liang, Liliang Ren, Tuo Zhao, Yelong Shen, Weizhu Chen

arXiv:2604.1433962.3h-index: 7

AI Analysis

For practitioners adapting short-context LLMs to long-context tasks, this method provides a simple regularizer that enhances robustness and extrapolation without requiring new data or architectures.

The paper identifies that standard long-context adaptation of LLMs leads to high positional variance, where accuracy depends on the absolute placement of evidence. They propose RoPE-Perturbed Self-Distillation, which improves positional robustness by training models to produce consistent predictions across perturbed context positions, achieving up to 12.04% improvement on RULER-64K for Llama-3-8B.

Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

View on arXiv PDF

Similar