CV AI CL LGMar 20, 2024

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Schusterbauer, Björn Ommer

arXiv:2403.13802v333.1143 citationsh-index: 13Has CodeECCV

Originality Incremental advance

AI Analysis

This work addresses efficiency problems in visual data generation for AI researchers, though it is incremental as it builds on existing Mamba and diffusion model frameworks.

The paper tackles scalability and quadratic complexity issues in diffusion models by introducing Zigzag Mamba, a method that improves spatial continuity in Mamba-based vision models, resulting in better performance, speed, and memory utilization compared to baselines on datasets like FacesHQ 1024x1024 and MS COCO 256x256.

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$ . Code will be released at https://taohu.me/zigma/

View on arXiv PDF Code

Similar