CVAIMay 18, 2025

Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

arXiv:2505.12207v35 citationsh-index: 26Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating large multimodal models in agricultural remote sensing for researchers and practitioners, but it is incremental as it builds on existing benchmarks by expanding dataset diversity and task complexity.

The authors tackled the lack of comprehensive benchmarks for agricultural remote sensing by introducing AgroMind, a benchmark covering four task dimensions with 13 task types, and found significant performance gaps in LMMs, particularly in spatial reasoning and fine-grained recognition, with human performance lagging behind some LMMs.

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 27,247 QA pairs and 19,615 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at https://rssysu.github.io/AgroMind/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes