SatSwinMAE: Efficient Autoencoding for Multiscale Time-series Satellite Imagery
This work addresses the need for efficient Earth observation models to handle multiscale spatio-temporal dependencies in satellite imagery, representing an incremental advancement by adapting existing methods to a new domain.
The authors tackled the problem of processing large-scale, unlabeled satellite time-series data by extending the SwinMAE model to integrate temporal information, resulting in significant performance improvements, including a 10.4% higher accuracy in land cover segmentation compared to other geospatial foundation models.
Recent advancements in foundation models have significantly impacted various fields, including natural language processing, computer vision, and multi-modal tasks. One area that stands to benefit greatly is Earth observation, where these models can efficiently process large-scale, unlabeled geospatial data. In this work we extend the SwinMAE model to integrate temporal information for satellite time-series data. The architecture employs a hierarchical 3D Masked Autoencoder (MAE) with Video Swin Transformer blocks to effectively capture multi-scale spatio-temporal dependencies in satellite imagery. To enhance transfer learning, we incorporate both encoder and decoder pretrained weights, along with skip connections to preserve scale-specific information. This forms an architecture similar to SwinUNet with an additional temporal component. Our approach shows significant performance improvements over existing state-of-the-art foundation models for all the evaluated downstream tasks: land cover segmentation, building density prediction, flood mapping, wildfire scar mapping and multi-temporal crop segmentation. Particularly, in the land cover segmentation task of the PhilEO Bench dataset, it outperforms other geospatial foundation models with a 10.4% higher accuracy.