IVAICVMay 11

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

arXiv:2605.1073916.5
Predicted impact top 9% in IV · last 90 daysOriginality Incremental advance
AI Analysis

This work provides a reproducible dataset and framework for language-guided reasoning about remote sensing activities, enabling not just change detection but understanding of ongoing processes.

The authors introduce SMART-HC-VQA, a VQA dataset for spatiotemporal analysis of human activity from Sentinel-2 imagery, comprising 21,837 image chips, 65,511 single-image VQA examples, and ~2.3 million temporal comparison examples. They also implement a multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B.

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes