Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas
This is an incremental study for researchers in computational linguistics and social sciences, exploring the use of social media data as a proxy for real-world linguistic diversity.
The study compared linguistic diversity measures between social media language data and census data in subnational areas of New Zealand, finding potential for using social media to track spatial and temporal changes in diversity, though further work is needed to assess representation accuracy.
This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand (i.e., subnational administrative areas). We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations. The results from the current study suggests that there is potential to use online social media language data to observe spatial and temporal changes in linguistic diversity at subnational geographic areas; however, further work is required to understand how well social media represents real-world behaviour.