A Large-Scale Empirical Study of Geotagging Behavior on Twitter
This work addresses the problem of validating geotagging assumptions for researchers using social media data, but it is incremental as it builds on prior studies without introducing new methods.
The study analyzed over 40 billion tweets from 20 million users to challenge assumptions about geotagging behavior on Twitter, finding that geotagging rates vary by language group (e.g., less than 3% for Korean vs. over 40% for Indonesian users), location reporting in profiles correlates with geotagging, and homophily influences preferences.
Geotagging on social media has become an important proxy for understanding people's mobility and social events. Research that uses geotags to infer public opinions relies on several key assumptions about the behavior of geotagged and non-geotagged users. However, these assumptions have not been fully validated. Lack of understanding the geotagging behavior prohibits people further utilizing it. In this paper, we present an empirical study of geotagging behavior on Twitter based on more than 40 billion tweets collected from 20 million users. There are three main findings that may challenge these common assumptions. Firstly, different groups of users have different geotagging preferences. For example, less than 3% of users speaking in Korean are geotagged, while more than 40% of users speaking in Indonesian use geotags. Secondly, users who report their locations in profiles are more likely to use geotags, which may affects the generability of those location prediction systems on non-geotagged users. Thirdly, strong homophily effect exists in users' geotagging behavior, that users tend to connect to friends with similar geotagging preferences.