CV AIApr 10

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

Xinyu Zhang, Zurong Mai, Qingmei Li, Zjin Liao, Yibin Wen, Yuhang Chen, Xiaoya Fan, Chan Tsz Ho, Bi Tianyuan, Haoyuan Liang, Ruifeng Su, Zihao Qian

arXiv:2604.0888497.5h-index: 5Has Code

Predicted impact top 18% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a gap for researchers in remote sensing and AI by providing a first benchmark for evaluating MLLMs on hyperspectral data, though it is incremental as it builds on existing MLLM frameworks.

The authors tackled the lack of evaluation for multimodal large language models (MLLMs) in hyperspectral remote sensing by introducing HM-Bench, a benchmark with 19,337 question-answer pairs across 13 tasks, revealing that MLLMs struggle with complex spatial-spectral reasoning and visual inputs outperform textual ones.

While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.

View on arXiv PDF Code

Similar