CVNov 1, 2023

CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

arXiv:2311.00566v1186 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses the challenge of leveraging multimodal remote sensing data for applications like classification and segmentation, representing an incremental advance in self-supervised learning for this domain.

The paper tackles the problem of learning representations from sparsely labeled, spatially aligned multimodal remote sensing data by proposing CROMA, a framework combining contrastive and reconstruction self-supervised objectives with novel attention mechanisms. The result shows that CROMA outperforms the current state-of-the-art multispectral model across multiple benchmarks, with average improvements ranging from 1.4% to 8.4% on classification and segmentation tasks.

A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples -- aligned in space and time -- and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to 17.6x larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks -- finetuning (avg. 1.8%), linear (avg. 2.4%) and nonlinear (avg. 1.4%) probing, kNN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%). CROMA's rich, optionally multimodal representations can be widely leveraged across remote sensing applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes