CLMay 20, 2025

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

arXiv:2505.14423v29 citationsh-index: 8EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses the scarcity of parallel data for low-resource languages in machine translation, though it is incremental as it applies existing LLM methods to a new domain.

The researchers tackled the problem of low-resource machine translation by generating synthetic data using large language models, resulting in substantial performance improvements across seven target languages and 147 language pairs, with automatic and human evaluation confirming high data quality.

We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its overall high quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, (iii) studying the effect of varying training data size, and (iiii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes