GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
This provides a valuable resource for NLP researchers working on Greek dialects, though it is incremental as it builds on an existing dataset.
The authors tackled the problem of limited dialectal data for Greek by creating GRDD+, an extended dataset with 6,374,939 words across 10 varieties, and evaluated its impact by fine-tuning three LLMs, showing improved performance compared to frontier models.
We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).