CVAIMay 8

LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation

arXiv:2605.0764081.8
Predicted impact top 26% in CV · last 90 daysOriginality Synthesis-oriented
AI Analysis

For geologists and remote sensing researchers, this benchmark addresses the lack of standardized evaluation for knowledge-intensive lithology interpretation, but it is an incremental contribution as it primarily provides a new dataset and evaluation framework.

LithoBench introduces a multi-level benchmark with 10,000 expert-annotated instances across 12 lithological categories for evaluating large multimodal models on remote-sensing lithology interpretation. Experiments reveal substantial limitations in higher-order geological semantic understanding.

Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes