CLApr 7, 2020

Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation

Dana Ruiter, Josef van Genabith, Cristina España-Bonet

arXiv:2004.03151v231.1999 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of data selection in machine translation for researchers, showing incremental improvements in self-supervised methods.

The study analyzed how self-supervised neural machine translation models autonomously select training samples of increasing complexity and relevance, leading to improved translation performance, as evidenced by progression from high school to undergraduate-level content in Wikipedia data.

Self-supervised neural machine translation (SSNMT) jointly learns to identify and select suitable training data from comparable (rather than parallel) corpora and to translate, in a way that the two tasks support each other in a virtuous circle. In this study, we provide an in-depth analysis of the sampling choices the SSNMT model makes during training. We show how, without it having been told to do so, the model self-selects samples of increasing (i) complexity and (ii) task-relevance in combination with (iii) performing a denoising curriculum. We observe that the dynamics of the mutual-supervision signals of both system internal representation types are vital for the extraction and translation performance. We show that in terms of the Gunning-Fog Readability index, SSNMT starts extracting and learning from Wikipedia data suitable for high school students and quickly moves towards content suitable for first year undergraduate students.

View on arXiv PDF

Similar