CLSep 22, 2025

Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

arXiv:2509.17855v13 citationsh-index: 9EMNLP
Originality Incremental advance
AI Analysis

This work addresses the understudied issue of dialect processing in LLMs, which is crucial for improving AI accessibility in linguistically diverse regions, though it is incremental as it focuses on a specific case study.

The study tackled the problem of LLMs' limited ability to process dialects by evaluating their lexical understanding of Bavarian, finding that they perform best on nouns and similar word pairs but struggle with distinguishing translations from inflected variants, with context improving translation but reducing variant recognition.

Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavarian as a case study and investigate the lexical dialect understanding capability of LLMs by examining how well they recognize and translate dialectal terms across different parts-of-speech. To this end, we introduce DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only, and use it to compile a ground truth dataset consisting of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our results show that LLMs perform best on nouns and lexically similar word pairs, and struggle most in distinguishing between direct translations and inflected variants. Interestingly, providing additional context in the form of example usages improves the translation performance, but reduces their ability to recognize dialect variants. This study highlights the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes