CLMar 26

Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

arXiv:2603.2548937.01 citationsh-index: 4
AI Analysis

This solves a domain-specific problem for low-resource language translation, with incremental improvements in data augmentation direction.

The paper tackled the problem of low-resource machine translation for Romansh by addressing LLM confusion across its 6 varieties, achieving a 23 BLEU improvement over Gemini 3 Pro in the lowest-resource variety and producing the first fluent translations in individual varieties.

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes