SOC-PH CL SI MLJul 26, 2014

Crowdsourcing Dialect Characterization through Twitter

arXiv:1407.7094v1102 citations

Originality Incremental advance

AI Analysis

This provides a novel, large-scale method for dialectology using social media data, offering insights into language variation for linguists and sociologists.

The authors tackled the problem of characterizing Spanish dialect variation by analyzing geotagged Twitter messages over two years, finding that Spanish splits into two superdialects: an urban speech across major cities and a diverse form in rural areas, which further clusters into regional varieties.

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.

View on arXiv PDF

Similar