Mapping Languages and Demographics with Georeferenced Corpora
This work addresses the problem of accurately mapping languages and demographics using online data for researchers and policymakers, but it is incremental as it compares existing datasets without introducing new methods.
The paper evaluated georeferenced corpora from web-crawled and social media sources against ground-truth data to determine which best represents population demographics and language use, finding that social media data correlates better with actual populations (r=0.60 vs. r=0.49) and predicts language inventories more accurately.
This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r=0.60 (social media) and r=0.49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.