CVJun 2, 2025

EarthMind: Leveraging Cross-Sensor Data for Advanced Earth Observation Interpretation with a Unified Multimodal LLM

arXiv:2506.01667v210 citationsh-index: 30
Originality Incremental advance
AI Analysis

This addresses the need for better environmental and human dynamics monitoring by enabling cross-sensor fusion in Earth Observation, though it appears incremental as it builds on existing MLLM approaches.

The paper tackled the problem of Earth Observation data analysis being limited by single-sensor inputs in Multimodal Large Language Models, and proposed EarthMind, a unified framework that achieved state-of-the-art results on a new benchmark and outperformed existing models on multiple EO benchmarks.

Earth Observation (EO) data analysis is vital for monitoring environmental and human dynamics. Recent Multimodal Large Language Models (MLLMs) show potential in EO understanding but remain restricted to single-sensor inputs, overlooking the complementarity across heterogeneous modalities. We propose EarthMind, a unified vision-language framework that handles both single- and cross-sensor inputs via an innovative hierarchical cross-modal attention (ie, HCA) design. Specifically, HCA hierarchically captures visual relationships across sensors and aligns them with language queries, enabling adaptive fusion of optical and Synthetic Aperture Radar (SAR) features. To support cross-sensor learning, we curate FusionEO, a 30K-pair dataset with diverse annotations, and establish EarthMind-Bench, a 2,841-pair benchmark with expert annotations for perception and reasoning tasks. Extensive experiments show that EarthMind achieves state-of-the-art results on EarthMind-Bench and surpasses existing MLLMs on multiple EO benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes