Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

arXiv:2605.1073916.5

Predicted impact top 9% in IV · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a reproducible dataset and framework for language-guided reasoning about remote sensing activities, enabling not just change detection but understanding of ongoing processes.

The authors introduce SMART-HC-VQA, a VQA dataset for spatiotemporal analysis of human activity from Sentinel-2 imagery, comprising 21,837 image chips, 65,511 single-image VQA examples, and ~2.3 million temporal comparison examples. They also implement a multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B.

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

View on arXiv PDF

Similar