CVMay 4, 2022

Self-Supervised Learning for Invariant Representations from Multi-Spectral and SAR Images

arXiv:2205.02049v250 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of adapting self-supervised learning to remote sensing data, which is incremental as it applies an existing method (BYOL) to a new domain with non-trivial differences from natural images.

The paper tackled the problem of learning invariant feature representations from remote sensing data, specifically multi-spectral and SAR images, using self-supervised learning, and achieved results such as a 0.92 F1 score on EuroSAT classification and 59.6 mIoU on DFC segmentation for certain single bands, outperforming supervised ImageNet-based models.

Self-Supervised learning (SSL) has become the new state-of-art in several domain classification and segmentation tasks. Of these, one popular category in SSL is distillation networks such as BYOL. This work proposes RSDnet, which applies the distillation network (BYOL) in the remote sensing (RS) domain where data is non-trivially different from natural RGB images. Since Multi-spectral (MS) and synthetic aperture radar (SAR) sensors provide varied spectral and spatial resolution information, we utilised them as an implicit augmentation to learn invariant feature embeddings. In order to learn RS based invariant features with SSL, we trained RSDnet in two ways, i.e., single channel feature learning and three channel feature learning. This work explores the usefulness of single channel feature learning from random MS and SAR bands compared to the common notion of using three or more bands. In our linear evaluation, these single channel features reached a 0.92 F1 score on the EuroSAT classification task and 59.6 mIoU on the DFC segmentation task for certain single bands. We also compared our results with ImageNet weights and showed that the RS based SSL model outperforms the supervised ImageNet based model. We further explored the usefulness of multi-modal data compared to single modality data, and it is shown that utilising MS and SAR data learn better invariant representations than utilising only MS data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes