G^3: Geolocation via Guidebook Grounding
This work addresses geolocation for applications like interactive games and location-based services, but it is incremental as it builds on existing methods by incorporating textual guidebooks.
The paper tackles the problem of geolocation by predicting the country where an image was taken, using explicit knowledge from human-written guidebooks to improve accuracy. It achieves a 5% improvement in Top-1 accuracy over a state-of-the-art image-only method.
We demonstrate how language can improve geolocation: the task of predicting the location where an image was taken. Here we study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation. We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations and an associated textual guidebook for GeoGuessr, a popular interactive geolocation game. Our approach predicts a country for each image by attending over the clues automatically extracted from the guidebook. Supervising attention with country-level pseudo labels achieves the best performance. Our approach substantially outperforms a state-of-the-art image-only geolocation method, with an improvement of over 5% in Top-1 accuracy. Our dataset and code can be found at https://github.com/g-luo/geolocation_via_guidebook_grounding.