CVLGJan 12, 2023

ViTs for SITS: Vision Transformers for Satellite Image Time Series

arXiv:2301.04944v3102 citationsh-index: 82
Originality Highly original
AI Analysis

This work addresses the challenge of analyzing satellite image time series for applications like semantic segmentation and classification, representing an incremental improvement over existing methods.

The paper tackles the problem of processing Satellite Image Time Series (SITS) by introducing TSViT, a vision transformer model that uses a temporal-then-spatial factorization and novel mechanisms like acquisition-time-specific positional encodings and multiple class tokens, achieving state-of-the-art performance with significant margins on three public datasets.

In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes