CLDec 20, 2022

On the Role of Parallel Data in Cross-lingual Transfer Learning

DeepMind

arXiv:2212.10173v121.8230 citationsh-index: 33

Originality Incremental advance

AI Analysis

This work addresses the problem of optimizing cross-lingual transfer learning for NLP researchers and practitioners, but it is incremental as it builds on prior findings about parallel data.

The study investigated whether improvements in cross-lingual transfer learning from parallel data stem from the data itself or the modeling of parallel interactions, by comparing unsupervised machine translation-generated synthetic data with supervised and gold parallel data. It found that synthetic data can be useful in general and task-specific settings, though real data yields the best results, suggesting that multilingual models underutilize monolingual data and prompting a reevaluation of cross-lingual learning approaches.

While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.

View on arXiv PDF

Similar