CVAILGMay 24, 2024

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

arXiv:2405.15881v127 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in generative models for researchers and practitioners, though it is incremental as it builds on existing Mamba and diffusion techniques.

The authors tackled the problem of high computational complexity in diffusion transformers for image and video generation by introducing Diffusion Mamba (DiM), which uses a scalable Mamba architecture to achieve linear complexity and outperform existing methods in these tasks.

In recent developments, the Mamba architecture, known for its selective state space approach, has shown potential in the efficient modeling of long sequences. However, its application in image generation remains underexplored. Traditional diffusion transformers (DiT), which utilize self-attention blocks, are effective but their computational complexity scales quadratically with the input length, limiting their use for high-resolution images. To address this challenge, we introduce a novel diffusion architecture, Diffusion Mamba (DiM), which foregoes traditional attention mechanisms in favor of a scalable alternative. By harnessing the inherent efficiency of the Mamba architecture, DiM achieves rapid inference times and reduced computational load, maintaining linear complexity with respect to sequence length. Our architecture not only scales effectively but also outperforms existing diffusion transformers in both image and video generation tasks. The results affirm the scalability and efficiency of DiM, establishing a new benchmark for image and video generation techniques. This work advances the field of generative models and paves the way for further applications of scalable architectures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes