CLAISep 29, 2025

LLaDA-MoE: A Sparse MoE Diffusion Language Model

arXiv:2509.24389v132 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses computational efficiency for researchers and practitioners using diffusion language models, though it is incremental as it builds on existing MoE and diffusion techniques.

The authors tackled the problem of high computational costs in large language diffusion models by introducing LLaDA-MoE, a sparse Mixture-of-Experts model trained on 20T tokens, which achieves state-of-the-art performance with only 1.4B active parameters during inference, matching or surpassing models like Qwen2.5-3B-Instruct in various tasks.

We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes