CVJul 27, 2025

L-MCAT: Unpaired Multimodal Transformer with Contrastive Attention for Label-Efficient Satellite Image Classification

arXiv:2507.20259v1h-index: 3

Originality Highly original

AI Analysis

This addresses the problem of limited labeled data for remote sensing applications, offering a robust and efficient solution for satellite image analysis, though it is incremental in improving multimodal transformer methods.

The paper tackles label-efficient satellite image classification by proposing L-MCAT, a transformer-based framework that uses unpaired multimodal data, achieving 95.4% overall accuracy on the SEN12MS dataset with only 20 labels per class while reducing parameters by 47x and FLOPs by 23x compared to baselines.

We propose the Lightweight Multimodal Contrastive Attention Transformer (L-MCAT), a novel transformer-based framework for label-efficient remote sensing image classification using unpaired multimodal satellite data. L-MCAT introduces two core innovations: (1) Modality-Spectral Adapters (MSA) that compress high-dimensional sensor inputs into a unified embedding space, and (2) Unpaired Multimodal Attention Alignment (U-MAA), a contrastive self-supervised mechanism integrated into the attention layers to align heterogeneous modalities without pixel-level correspondence or labels. L-MCAT achieves 95.4% overall accuracy on the SEN12MS dataset using only 20 labels per class, outperforming state-of-the-art baselines while using 47x fewer parameters and 23x fewer FLOPs than MCTrans. It maintains over 92% accuracy even under 50% spatial misalignment, demonstrating robustness for real-world deployment. The model trains end-to-end in under 5 hours on a single consumer GPU.

View on arXiv PDF

Similar