CL AI LGMay 30, 2025

VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem, Quang-Nhan Nguyen, Wei Ai, Marine Carpuat

arXiv:2505.24472v12.7h-index: 36

Originality Incremental advance

AI Analysis

This addresses the challenge of code-mixed translation for low-resource language pairs, advancing ecological validity in neural MT evaluations, though it is incremental as it builds on existing data augmentation methods.

The paper tackled the problem of machine translation failing on code-mixed inputs for low-resource languages by curating VietMix, a naturally occurring Vietnamese-English code-mixed corpus with expert translations, and developing a synthetic data augmentation pipeline, resulting in performance boosts of up to 71.84 on COMETkiwi and 81.77 on XCOMET.

Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.

View on arXiv PDF

Similar