CVDec 30, 2024

UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

arXiv:2412.20742v18 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This work addresses the domain gap in remote sensing for researchers and practitioners by providing a unified framework for multi-temporal analysis, though it is incremental as it builds on existing vision-language models.

The authors tackled the problem of handling different types of visual inputs in remote sensing tasks by introducing UniRS, a vision-language model that unifies multi-temporal analysis across single images, dual-time pairs, and videos, achieving state-of-the-art performance in tasks like visual question answering, change captioning, and video scene classification.

The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbf{UniRS}, the first vision-language model \textbf{uni}fying multi-temporal \textbf{r}emote \textbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes