CVApr 15

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

arXiv:2604.1404494.2h-index: 6
AI Analysis

This work provides a unified framework for remote sensing change detection and understanding, benefiting earth observation intelligence by enabling more accurate and comprehensive analysis of multi-temporal imagery.

The paper addresses the temporal blindness of Multimodal Large Language Models (MLLMs) in remote sensing change understanding by introducing Delta-QA, a benchmark with 180k samples, and Delta-LLaVA, a novel MLLM framework. Delta-LLaVA outperforms leading generalist MLLMs and specialized segmentation models in change deduction and boundary localization.

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes