A Comparative Analysis of Content-based Geolocation in Blogs and Tweets
This work addresses geolocation challenges for social media analysis, offering incremental improvements in feature design and cross-media performance.
The paper tackled geolocation of online content by comparing text-based methods on Blogger and Twitter data, introducing novel location-specific features that reduced error rates by up to 12.5% compared to previous features, and found Blogger users harder to geolocate despite longer posts.
The geolocation of online information is an essential component in any geospatial application. While most of the previous work on geolocation has focused on Twitter, in this paper we quantify and compare the performance of text-based geolocation methods on social media data drawn from both Blogger and Twitter. We introduce a novel set of location specific features that are both highly informative and easily interpretable, and show that we can achieve error rate reductions of up to 12.5% with respect to the best previously proposed geolocation features. We also show that despite posting longer text, Blogger users are significantly harder to geolocate than Twitter users. Additionally, we investigate the effect of training and testing on different media (cross-media predictions), or combining multiple social media sources (multi-media predictions). Finally, we explore the geolocability of social media in relation to three user dimensions: state, gender, and industry.