CVJan 5, 2024

Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing

arXiv:2401.02764v212 citationsh-index: 44IGARSS
AI Analysis

This addresses the challenge of high labeling costs and domain gaps in multimodal remote sensing data fusion, though it is incremental as it builds on existing masked autoencoder approaches.

The paper tackled the problem of self-supervised representation learning for remote sensing by introducing Fus-MAE, a masked autoencoder framework using cross-attention for early fusion of SAR and optical data, which effectively competes with contrastive learning methods and outperforms other masked autoencoders trained on larger datasets.

Self-supervised frameworks for representation learning have recently stirred up interest among the remote sensing community, given their potential to mitigate the high labeling costs associated with curating large satellite image datasets. In the realm of multimodal data fusion, while the often used contrastive learning methods can help bridging the domain gap between different sensor types, they rely on data augmentations techniques that require expertise and careful design, especially for multispectral remote sensing data. A possible but rather scarcely studied way to circumvent these limitations is to use a masked image modelling based pretraining strategy. In this paper, we introduce Fus-MAE, a self-supervised learning framework based on masked autoencoders that uses cross-attention to perform early and feature-level data fusion between synthetic aperture radar and multispectral optical data - two modalities with a significant domain gap. Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion and outperforms other masked-autoencoders frameworks trained on a larger corpus.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes