GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots
This work addresses the need for realistic benchmarks in geospatial copilots for Earth Observation applications, moving beyond simplified single-task evaluations.
The paper tackles the problem of evaluating geospatial AI agents in realistic multi-tool scenarios by introducing GeoLLM-Engine, an environment that scales to over 500,000 diverse tasks and 1.1 million satellite images, enabling assessment of agents' proficiency in interpreting natural language commands and task correctness.
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.