CVSep 30, 2024

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

arXiv:2409.19937v110 citationsh-index: 40
Originality Incremental advance
AI Analysis

This work provides a more efficient and higher-quality image generation model for researchers and practitioners in computer vision, offering an incremental improvement over existing architectures.

This paper introduces MaskMamba, a hybrid Mamba-Transformer model for non-autoregressive masked image generation. It addresses scalability and quadratic complexity issues in image generation, achieving superior generation quality compared to Mamba and Transformer models, and a 54.44% inference speed improvement over Transformer at 2048x2048 resolution.

Image generation models have encountered challenges related to scalability and quadratic complexity, primarily due to the reliance on Transformer-based backbones. In this study, we introduce MaskMamba, a novel hybrid model that combines Mamba and Transformer architectures, utilizing Masked Image Modeling for non-autoregressive image synthesis. We meticulously redesign the bidirectional Mamba architecture by implementing two key modifications: (1) replacing causal convolutions with standard convolutions to better capture global context, and (2) utilizing concatenation instead of multiplication, which significantly boosts performance while accelerating inference speed. Additionally, we explore various hybrid schemes of MaskMamba, including both serial and grouped parallel arrangements. Furthermore, we incorporate an in-context condition that allows our model to perform both class-to-image and text-to-image generation tasks. Our MaskMamba outperforms Mamba-based and Transformer-based models in generation quality. Notably, it achieves a remarkable $54.44\%$ improvement in inference speed at a resolution of $2048\times 2048$ over Transformer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes