Balancing Training for Multilingual Neural Machine Translation
This addresses data imbalance issues for researchers and practitioners in multilingual machine translation, offering a more effective alternative to standard up-sampling techniques.
The paper tackles the problem of imbalanced training data in multilingual machine translation by proposing a method that automatically learns to weight training data to maximize performance across all languages, achieving consistent improvements over heuristic baselines in both one-to-many and many-to-one settings.
When training multilingual machine translation (MT) models that can translate to/from multiple languages, we are faced with imbalanced training sets: some languages have much more training data than others. Standard practice is to up-sample less resourced languages to increase representation, and the degree of up-sampling has a large effect on the overall performance. In this paper, we propose a method that instead automatically learns how to weight training data through a data scorer that is optimized to maximize performance on all test languages. Experiments on two sets of languages under both one-to-many and many-to-one MT settings show our method not only consistently outperforms heuristic baselines in terms of average performance, but also offers flexible control over the performance of which languages are optimized.