CVAug 25, 2024

MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation

Chaowei Chen, Li Yu, Shiquan Min, Shunfang Wang

arXiv:2408.13735v110.524 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of accurate medical image segmentation for healthcare applications, representing an incremental improvement over existing SSM-based approaches.

The paper tackles medical image segmentation by proposing MSVM-UNet, which integrates multi-scale convolutions and large kernel patch expanding layers to better capture multi-scale features and handle 2D data, achieving improved performance over state-of-the-art methods on Synapse and ACDC datasets.

State Space Models (SSMs), especially Mamba, have shown great promise in medical image segmentation due to their ability to model long-range dependencies with linear computational complexity. However, accurate medical image segmentation requires the effective learning of both multi-scale detailed feature representations and global contextual dependencies. Although existing works have attempted to address this issue by integrating CNNs and SSMs to leverage their respective strengths, they have not designed specialized modules to effectively capture multi-scale feature representations, nor have they adequately addressed the directional sensitivity problem when applying Mamba to 2D image data. To overcome these limitations, we propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder and better handle 2D visual data. Additionally, the large kernel patch expanding (LKPE) layers achieve more efficient upsampling of feature maps by simultaneously integrating spatial and channel information. Extensive experiments on the Synapse and ACDC datasets demonstrate that our approach is more effective than some state-of-the-art methods in capturing and aggregating multi-scale feature representations and modeling long-range dependencies between pixels.

View on arXiv PDF Code

Similar