CVJun 24, 2024

Multi-Modal Vision Transformers for Crop Mapping from Satellite Image Time Series

arXiv:2406.16513v14 citations
Originality Incremental advance
AI Analysis

This work addresses crop mapping for agricultural monitoring, representing an incremental advance by adapting existing transformer methods to multi-modal data.

The paper tackled crop mapping from satellite image time series by introducing multi-modal transformer architectures, resulting in significant improvements over state-of-the-art methods.

Using images acquired by different satellite sensors has shown to improve classification performance in the framework of crop mapping from satellite image time series (SITS). Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Attention Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes