CLLGMar 25, 2025

Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair

arXiv:2503.20007v112 citationsh-index: 6NAACL
AI Analysis

This addresses the problem of machine translation for a specific, under-resourced language pair with code-switching, which is incremental as it applies known synthetic data methods to a new domain.

The paper tackles machine translation for the low-resource, code-switched Kazakh-Russian language pair without labeled data by generating synthetic data, achieving 16.48 BLEU score and outperforming a commercial system in human evaluation.

Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes