Evaluating Tool-Augmented Agents in Remote Sensing Platforms
This work addresses the need for more realistic evaluation of agents in remote sensing applications, though it is incremental as it focuses on benchmarking rather than a new method.
The paper tackles the problem of evaluating tool-augmented LLMs in remote sensing by addressing the gap between existing benchmarks and realistic user-grounded tasks, resulting in the creation of the GeoLLM-QA benchmark with insights from evaluating state-of-the-art LLMs on 1,000 tasks.
Tool-augmented Large Language Models (LLMs) have shown impressive capabilities in remote sensing (RS) applications. However, existing benchmarks assume question-answering input templates over predefined image-text data pairs. These standalone instructions neglect the intricacies of realistic user-grounded tasks. Consider a geospatial analyst: they zoom in a map area, they draw a region over which to collect satellite imagery, and they succinctly ask "Detect all objects here". Where is `here`, if it is not explicitly hardcoded in the image-text template, but instead is implied by the system state, e.g., the live map positioning? To bridge this gap, we present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform. Through in-depth evaluation of state-of-the-art LLMs over a diverse set of 1,000 tasks, we offer insights towards stronger agents for RS applications.