Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation
This work addresses data quality issues in neural machine translation, but it is incremental as it builds on existing selection methods by introducing a dynamic composition.
The paper tackles the problem of improving neural machine translation by dynamically combining domain-data selection and clean-data selection, which were previously handled separately or statically, and demonstrates effectiveness with experimental results in two domains.
Noise and domain are important aspects of data quality for neural machine translation. Existing research focus separately on domain-data selection, clean-data selection, or their static combination, leaving the dynamic interaction across them not explicitly examined. This paper introduces a "co-curricular learning" method to compose dynamic domain-data selection with dynamic clean-data selection, for transfer learning across both capabilities. We apply an EM-style optimization procedure to further refine the "co-curriculum". Experiment results and analysis with two domains demonstrate the effectiveness of the method and the properties of data scheduled by the co-curriculum.