CLSep 6, 2024

Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak

Mukhammadsaid Mamasaidov, Abror Shopulatov

arXiv:2409.04269v122 citationsh-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of limited machine translation capabilities for Karakalpak speakers, contributing to linguistic diversity in NLP, but it is incremental as part of a shared task.

This study tackled the problem of low-resource machine translation for the Karakalpak language by creating datasets and open-sourced models, resulting in improvements over existing baselines as demonstrated in experiments.

This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.

View on arXiv PDF

Similar