CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
This addresses the challenge of 'last-mile' navigation for users in dynamic urban environments, though it is incremental as it builds on existing VLM navigation benchmarks by focusing on implicit needs.
The paper tackles the problem of Vision-Language Models (VLMs) interpreting implicit human needs (e.g., 'I am thirsty') for embodied urban navigation, introducing the CitySeeker benchmark with 6,440 trajectories across 8 cities, and finds that top-performing models achieve only 21.1% task completion.
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.