Tuned and GPU-accelerated parallel data mining from comparable corpora
This work addresses the data scarcity issue for machine translation, particularly for low-resource languages and narrow domains, though it is incremental as it builds on an existing method.
The researchers tackled the problem of limited parallel data for statistical translation by improving the Yalign mining methodology, achieving a 15% increase in alignment accuracy across multiple text domains using Wikipedia dumps.
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such has a very limited availability especially for some languages and very narrow text domains. Is this research we present our improvements to Yalign mining methodology by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration. The experiments are conducted on various text domains and bi-data is extracted from the Wikipedia dumps.