CVOct 14, 2024

GlobalMamba: Global Image Serialization for Vision Mamba

arXiv:2410.10316v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the limitation of sequential processing in vision mambas for computer vision tasks, offering a domain-specific improvement.

The paper tackles the problem of vision mambas ignoring 2D structural correlations and global information by proposing GlobalMamba, a method that serializes images globally using frequency domain transformations, resulting in improved performance on tasks like ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes