Exploring Diachronic and Diatopic Changes in Dialect Continua: Tasks, Datasets and Challenges
It addresses the gap in inclusive NLP for non-standard language communities by providing a unified review of dialect studies, though it is incremental as it synthesizes existing research rather than proposing new methods.
This paper systematically reviews research on both diachronic (time-based) and diatopic (geographic) changes in dialect continua, assessing nine tasks and datasets across five dialects from three language families, and outlines five open challenges for future work.
Everlasting contact between language communities leads to constant changes in languages over time, and gives rise to language varieties and dialects. However, the communities speaking non-standard language are often overlooked by non-inclusive NLP technologies. Recently, there has been a surge of interest in studying diatopic and diachronic changes in dialect NLP, but there is currently no research exploring the intersection of both. Our work aims to fill this gap by systematically reviewing diachronic and diatopic papers from a unified perspective. In this work, we critically assess nine tasks and datasets across five dialects from three language families (Slavic, Romance, and Germanic) in both spoken and written modalities. The tasks covered are diverse, including corpus construction, dialect distance estimation, and dialect geolocation prediction, among others. Moreover, we outline five open challenges regarding changes in dialect use over time, the reliability of dialect datasets, the importance of speaker characteristics, limited coverage of dialects, and ethical considerations in data collection. We hope that our work sheds light on future research towards inclusive computational methods and datasets for language varieties and dialects.