CLAug 1, 2023

GRDD: A Dataset for Greek Dialectal NLP

arXiv:2308.00802v51 citationsh-index: 19
Originality Synthesis-oriented
AI Analysis

This provides a foundational dataset for NLP research on Greek dialects, addressing a gap in resources for computational linguistics in this domain.

The authors tackled the lack of large-scale dialectal resources for Modern Greek by creating the GRDD dataset, which includes text from four dialects, and used it for dialect identification, achieving very good performance with simple ML and DL models.

In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes