CLJan 10

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

arXiv:2601.031351 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses the data scarcity problem for low-resource indigenous language translation, but the gains are incremental and limited to specific language pairs.

The authors augment parallel corpora for indigenous languages (Guarani-Spanish, Quechua-Spanish) with synthetic data from a multilingual model and apply language-specific preprocessing, achieving consistent chrF++ improvements over baseline mBART models.

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes