CLFeb 28, 2025

Detecting Linguistic Diversity on Social Media

arXiv:2502.21224v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the need for more dynamic language data sources for researchers and policymakers, but it is incremental as it applies existing methods to a new context.

The study tackled the problem of measuring linguistic diversity by using social media data as an alternative to census data in Aotearoa New Zealand, showing that social media can provide spatial and temporal insights into linguistic profiles and is sensitive to demographic changes at regional and local levels.

This chapter explores the efficacy of using social media data to examine changing linguistic behaviour of a place. We focus our investigation on Aotearoa New Zealand where official statistics from the census is the only source of language use data. We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We use place as the common denominator between the two data sources. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. We then compare levels of linguistic diversity at national, regional, and local geographies. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place. We show that social media is sensitive to demographic and sociopolitical changes within a language and at low-level regional and local geographies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes