CL AIMay 24

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

arXiv:2606.0009120.3h-index: 2

Predicted impact top 73% in CL · last 90 daysOriginality Highly original

AI Analysis

For researchers in self-supervised learning and language model fine-tuning, DLLM-JEPA offers a more efficient and effective alternative to existing JEPA-based methods, with demonstrated improvements across diverse tasks and architectures.

DLLM-JEPA pairs Joint Embedding Predictive Architectures with masked-diffusion language models, eliminating the need for explicit multi-view data and reducing training FLOPs by 33% compared to LLM-JEPA. It achieves up to +18.7 pp accuracy improvement on GSM8K and consistent gains across multiple benchmarks while preserving base MMLU accuracy.

Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.

View on arXiv PDF

Similar