Joint multi-modal Self-Supervised pre-training in Remote Sensing: Application to Methane Source Classification
This work addresses the annotation bottleneck in remote sensing for applications like methane monitoring, but it is incremental as it builds on existing self-supervised methods by incorporating domain-specific multi-modal data.
The paper tackles the problem of needing large labeled datasets for deep learning in remote sensing by proposing a self-supervised pre-training method that leverages multiple sensor modalities and geographical data to learn image encoders without annotations, and it applies this to methane source classification, achieving competitive performance on this specific task.
With the current ubiquity of deep learning methods to solve computer vision and remote sensing specific tasks, the need for labelled data is growing constantly. However, in many cases, the annotation process can be long and tedious depending on the expertise needed to perform reliable annotations. In order to alleviate this need for annotations, several self-supervised methods have recently been proposed in the literature. The core principle behind these methods is to learn an image encoder using solely unlabelled data samples. In earth observation, there are opportunities to exploit domain-specific remote sensing image data in order to improve these methods. Specifically, by leveraging the geographical position associated with each image, it is possible to cross reference a location captured from multiple sensors, leading to multiple views of the same locations. In this paper, we briefly review the core principles behind so-called joint-embeddings methods and investigate the usage of multiple remote sensing modalities in self-supervised pre-training. We evaluate the final performance of the resulting encoders on the task of methane source classification.