CVJun 23, 2025

Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

arXiv:2506.18999v13 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges in high-resolution image generation for AI and computer vision applications, though it is incremental as it builds on existing transformer and Mamba models.

The paper tackled the high computational cost of self-attention in diffusion transformers for high-resolution image generation by introducing a distillation method to transition to the more efficient Mamba model, achieving high-quality text-to-image generation up to 2048x2048 resolution with low overhead.

The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled 512$\times$512 resolution base model, we push the generation towards 2048$\times$2048 images via lightweight adaptation and high-resolution fine-tuning. Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation. Importantly, our results also justify the feasibility of using sequential and causal Mamba models for generating non-causal visual output, suggesting the potential for future exploration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes