CLAIOct 12, 2025

UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

arXiv:2510.10481v111 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses the long-context limitation in diffusion LLMs, which is an incremental improvement for practitioners needing efficient post-training methods.

The authors tackled the problem of extending context length in diffusion large language models (LLaDA) without full retraining, achieving a 128K-token context window that significantly outperforms training-free baselines on long-context tasks.

Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes