CVMay 27, 2025

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding

Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, Naoto Yokoya

arXiv:2505.21076v221.121 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the problem of analyzing urban dynamics for remote sensing and AI researchers, but it is incremental as it builds on existing MLLM capabilities with a new dataset and baseline model.

The paper tackles the limited application of multimodal large language models (MLLMs) to long-term Earth observation by introducing DVL-Suite, a framework with 14,871 multi-temporal images across 42 U.S. cities, and reveals that 18 state-of-the-art MLLMs have limitations in long-term temporal understanding and quantitative analysis.

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in visual understanding, but their application to long-term Earth observation analysis remains limited, primarily focusing on single-temporal or bi-temporal imagery. To address this gap, we introduce DVL-Suite, a comprehensive framework for analyzing long-term urban dynamics through remote sensing imagery. Our suite comprises 14,871 high-resolution (1.0m) multi-temporal images spanning 42 major cities in the U.S. from 2005 to 2023, organized into two components: DVL-Bench and DVL-Instruct. The DVL-Bench includes six urban understanding tasks, from fundamental change detection (pixel-level) to quantitative analyses (regional-level) and comprehensive urban narratives (scene-level), capturing diverse urban dynamics including expansion/transformation patterns, disaster assessment, and environmental challenges. We evaluate 18 state-of-the-art MLLMs and reveal their limitations in long-term temporal understanding and quantitative analysis. These challenges motivate the creation of DVL-Instruct, a specialized instruction-tuning dataset designed to enhance models' capabilities in multi-temporal Earth observation. Building upon this dataset, we develop DVLChat, a baseline model capable of both image-level question-answering and pixel-level segmentation, facilitating a comprehensive understanding of city dynamics through language interactions.

View on arXiv PDF

Similar