CY AI IRMay 30, 2025

The World As Large Language Models See It: Exploring the reliability of LLMs in representing geographical features

Omid Reza Abbasi, Franz Welscher, Georg Weinberger, Johannes Scholz

arXiv:2506.00203v12.32 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

This addresses the problem of factual trustworthiness in LLMs for geographic applications, particularly for GIScience and Geoinformatics users, though it is incremental as it applies existing evaluation methods to new models.

This study evaluated the reliability of GPT-4o and Gemini 2.0 Flash in representing geographical features through geocoding, elevation estimation, and reverse geocoding tasks in Austria, finding systematic errors and inconsistencies in both models with Gemini generally outperforming GPT-4o but neither achieving accurate reconstructions.

As large language models (LLMs) continue to evolve, questions about their trustworthiness in delivering factual information have become increasingly important. This concern also applies to their ability to accurately represent the geographic world. With recent advancements in this field, it is relevant to consider whether and to what extent LLMs' representations of the geographical world can be trusted. This study evaluates the performance of GPT-4o and Gemini 2.0 Flash in three key geospatial tasks: geocoding, elevation estimation, and reverse geocoding. In the geocoding task, both models exhibited systematic and random errors in estimating the coordinates of St. Anne's Column in Innsbruck, Austria, with GPT-4o showing greater deviations and Gemini 2.0 Flash demonstrating more precision but a significant systematic offset. For elevation estimation, both models tended to underestimate elevations across Austria, though they captured overall topographical trends, and Gemini 2.0 Flash performed better in eastern regions. The reverse geocoding task, which involved identifying Austrian federal states from coordinates, revealed that Gemini 2.0 Flash outperformed GPT-4o in overall accuracy and F1-scores, demonstrating better consistency across regions. Despite these findings, neither model achieved an accurate reconstruction of Austria's federal states, highlighting persistent misclassifications. The study concludes that while LLMs can approximate geographic information, their accuracy and reliability are inconsistent, underscoring the need for fine-tuning with geographical information to enhance their utility in GIScience and Geoinformatics.

View on arXiv PDF

Similar