Representations of Language Varieties Are Reliable Given Corpus Similarity Measures
This work addresses the need for reliable data sources in computational linguistics, particularly for researchers studying language variation, but it is incremental as it builds on existing corpus similarity methods.
The paper tackled the problem of evaluating the reliability of digital geo-referenced corpora for modeling linguistic variation by measuring similarity across 84 language varieties from web and tweet sources, finding consistent agreement using frequency-based measures, which provides evidence that these corpora reliably represent local language varieties.
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.