UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction
This work addresses urban planning and sustainable development challenges by improving prediction accuracy for socioeconomic indicators, though it appears incremental by building on prior models like UrbanCLIP.
The paper tackles urban socioeconomic indicator prediction by addressing biases from macro-level satellite data and unreliable text generation, proposing UrbanVLP, a vision-language pretraining framework that integrates multi-granularity imagery and automated text calibration, achieving superior performance across six tasks.
Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.