CLMar 8

Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

arXiv:2603.07475v14 citations
Predicted impact top 25% in CL · last 90 daysOriginality Highly original
AI Analysis

This work provides insights into the representational differences between dLLMs and AR models, offering a path to improve inference efficiency for dLLMs without architectural changes or KV-cache sharing.

This paper investigates the internal representations of diffusion language models (dLLMs) compared to autoregressive (AR) models, finding that dLLMs exhibit more hierarchical and redundant representations. Leveraging this, the authors introduce a layer-skipping method for dLLMs, achieving up to 18.75% FLOPs reduction while maintaining over 90% performance on reasoning and code generation benchmarks.

Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes