CV AIMar 13, 2025

RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

Fengxiang Wang, Yulin Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Hongzhen Wang, Di Wang, Long Lan, Wenjing Yang, Jing Zhang

arXiv:2503.10392v218.211 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient large-scale remote sensing model training for researchers and practitioners, offering a more scalable alternative with demonstrated gains.

The paper tackles the scalability limitations of Vision Transformers in remote sensing by proposing RoMA, a framework for self-supervised pretraining of Mamba-based models, which outperforms ViT-based counterparts in accuracy and efficiency across tasks like scene classification and object detection.

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.

View on arXiv PDF Code

Similar