Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
This addresses the problem of machine translation for a specific, under-resourced language pair with code-switching, which is incremental as it applies known synthetic data methods to a new domain.
The paper tackles machine translation for the low-resource, code-switched Kazakh-Russian language pair without labeled data by generating synthetic data, achieving 16.48 BLEU score and outperforming a commercial system in human evaluation.
Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. Additionally, we present the first codeswitching Kazakh-Russian parallel corpus and the evaluation results, which include a model achieving 16.48 BLEU almost reaching an existing commercial system and beating it by human evaluation.