Filtering and Mining Parallel Data in a Joint Multilingual Space
This addresses data quality issues for machine translation practitioners, offering a generic method applicable to many language pairs, though it is incremental as it builds on existing embedding techniques.
The paper tackled the problem of noisy parallel data in machine translation by learning a joint multilingual sentence embedding to filter and mine parallel data, improving a WMT'14 English to German baseline by 0.3 BLEU after filtering out 25% of training data.
We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive baseline on the WMT'14 English to German task by 0.3 BLEU by filtering out 25% of the training data. The same approach is used to mine additional bitexts for the WMT'14 system and to obtain competitive results on the BUCC shared task to identify parallel sentences in comparable corpora. The approach is generic, it can be applied to many language pairs and it is independent of the architecture of the machine translation system.