CLOct 23, 2023

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

arXiv:2310.15262v1134 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This work addresses data scarcity in machine translation for code-switched texts, providing insights for researchers and practitioners, but it is incremental as it compares existing methods without introducing new ones.

The study compared three data augmentation techniques for machine translation of Egyptian Arabic-English code-switched texts, finding that back-translation and predictive lexical replacement performed best, while linguistic theories and random replacement were effective when parallel data was scarce.

Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes