CLJan 2, 2020

Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation

Yirong Pan, Xiao Li, Yating Yang, Rui Dong

arXiv:2001.01589v11.630 citations

Originality Incremental advance

AI Analysis

This addresses data sparseness and language complexity for low-resource, morphologically-rich agglutinative languages in machine translation, though it is incremental as it builds on existing preprocessing techniques.

The paper tackled the problem of rare and unknown words in neural machine translation (NMT) for agglutinative languages by proposing a morphological word segmentation method on the source-side, which achieved significant improvements on Turkish-English and Uyghur-Chinese translation tasks.

Neural machine translation (NMT) has achieved impressive performance on machine translation task in recent years. However, in consideration of efficiency, a limited-size vocabulary that only contains the top-N highest frequency words are employed for model training, which leads to many rare and unknown words. It is rather difficult when translating from the low-resource and morphologically-rich agglutinative languages, which have complex morphology and large vocabulary. In this paper, we propose a morphological word segmentation method on the source-side for NMT that incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time. It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks. Experimental results show that our morphologically motivated word segmentation method is better suitable for the NMT model, which achieves significant improvements on Turkish-English and Uyghur-Chinese machine translation tasks on account of reducing data sparseness and language complexity.

View on arXiv PDF

Similar