CLAug 20, 2025

Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

arXiv:2508.14586v11 citationsh-index: 2Proceedings of the Tenth Conference on Machine Translation
Originality Synthesis-oriented
AI Analysis

This work addresses the underrepresentation of Southern Uzbek in NLP, benefiting speakers and researchers, but it is incremental as it builds on existing models and resources.

The authors tackled the lack of machine translation resources for Southern Uzbek, a low-resource language, by creating new datasets and a fine-tuned model, achieving improvements through a post-processing method for handling morphological boundaries.

Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes