CLJan 13, 2021

Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

arXiv:2101.05162v11.28 citations

Originality Incremental advance

AI Analysis

This provides a solution for producing machine transliterated texts for the low-resource Uzbek language, which is an incremental improvement in a domain-specific context.

The paper tackles the problem of transliterating Uzbek dictionary words between Cyrillic and Latin scripts using a data-driven approach, achieving character-level micro-averaged F1 scores of 0.9992 for Cyrillic to Latin and 0.9959 for Latin to Cyrillic on a test set.

In this paper, we introduce a data-driven approach to transliterating Uzbek dictionary words from the Cyrillic script into the Latin script, and vice versa. We heuristically align characters of words in the source script with sub-strings of the corresponding words in the target script and train a decision tree classifier that learns these alignments. On the test set, our Cyrillic to Latin model achieves a character level micro-averaged F1 score of 0.9992, and our Latin to Cyrillic model achieves the score of 0.9959. Our contribution is a novel method of producing machine transliterated texts for the low-resource Uzbek language.

View on arXiv PDF

Similar