AICLMay 31, 2025

CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

arXiv:2506.00530v15 citationsh-index: 21Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for standardized evaluation of LLVMs in urban socioeconomic sensing, which is incremental as it provides a new benchmark rather than a novel method.

The authors tackled the problem of evaluating large language-vision models (LLVMs) for predicting urban socioeconomic indicators from visual data by introducing CityLens, a benchmark with a multi-modal dataset covering 17 cities and 11 tasks, and found that LLVMs show promise but have limitations in this domain.

Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce $\textbf{CityLens}$, a comprehensive benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across these tasks. Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LLVMs to understand and predict urban socioeconomic patterns. Our codes and datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes