CLMay 22, 2023

Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

arXiv:2305.12786v2135 citations
Originality Incremental advance
AI Analysis

This addresses translation quality issues for low-resource languages in multilingual systems, though it appears incremental as it builds on existing MNMT approaches.

The paper tackles data imbalance and representation degeneration in multilingual machine translation by proposing Bi-ACL, a framework using target-side monolingual data and bilingual dictionaries, which improves performance for both long-tail and high-resource languages.

Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes