CVJul 22, 2025

M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

arXiv:2507.16318v28 citationsh-index: 7Has Code
Originality Highly original
AI Analysis

This work addresses the need for robust perception in complex environments by unifying case-by-case RGBT studies into a single paradigm, though it is incremental in building on prior multispectral fusion research.

The paper tackles the problem of RGB-Thermal multispectral vision by proposing M-SpecGene, a generalized foundation model that learns modality-invariant representations through self-supervised pre-training, achieving state-of-the-art performance across eleven datasets for four downstream tasks.

RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at https://github.com/CalayZhou/M-SpecGene.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes