CVAug 29, 2025

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

arXiv:2508.21565v11 citationsh-index: 52025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This addresses the problem of adapting general-purpose VLMs to specialized urban domains for applications like autonomous driving or urban planning, though it is incremental as it builds on existing models and methods.

The study evaluated how well vision-language models (VLMs) handle spatial reasoning in urban scenes from street-view images, finding that fine-tuning with a synthetic dataset improved performance, especially on challenging question types like negation and counterfactuals.

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes