Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views
This work addresses location privacy risks in social media by providing a domain-specific benchmark for Korean VLMs, though it is incremental as it builds on existing geolocation evaluation methods.
The authors tackled the problem of evaluating vision-language models (VLMs) for geolocation in Korean street views, presenting KoreaGEO Bench, a fine-grained benchmark with 1,080 images and multi-contextual annotations, and found modality-driven shifts in localization precision and structural biases toward core cities.
Recent advances in vision-language models (VLMs) have enabled accurate image-based geolocation, raising serious concerns about location privacy risks in everyday social media posts. However, current benchmarks remain coarse-grained, linguistically biased, and lack multimodal and privacy-aware evaluations. To address these gaps, we present KoreaGEO Bench, the first fine-grained, multimodal geolocation benchmark for Korean street views. Our dataset comprises 1,080 high-resolution images sampled across four urban clusters and nine place types, enriched with multi-contextual annotations and two styles of Korean captions simulating real-world privacy exposure. We introduce a three-path evaluation protocol to assess ten mainstream VLMs under varying input modalities and analyze their accuracy, spatial bias, and reasoning behavior. Results reveal modality-driven shifts in localization precision and highlight structural prediction biases toward core cities.