MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis
This work addresses the problem of analyzing gigapixel-scale whole-slide images for medical diagnostics by improving the modeling of structured dependencies, representing an incremental advancement over existing multiple instance learning and state space model approaches.
The paper tackled the challenge of whole-slide image analysis by proposing MoEMambaMIL, a structure-aware selective state space modeling framework that organizes patch tokens into region-aware sequences to capture spatial hierarchies, achieving state-of-the-art performance across 9 downstream tasks.
Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.