CLNov 7, 2019

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

Wei Zhang, Feifei Lin, Xiaodong Wang, Zhenshuang Liang, Zhen Huang

arXiv:1911.02737v10.34 citations

Originality Incremental advance

AI Analysis

This addresses the need for fine-grained translation in Chinese, offering a simpler preprocessing-free method for Chinese-English NMT, though it is incremental as it adapts existing techniques to a specific language pair.

The paper tackles Chinese-English neural machine translation by using Wubi encoding to enable sub-character-level processing, achieving comparable BLEU scores to subword models with a much smaller vocabulary, which benefits model compression.

Neural machine translation (NMT) is one of the best methods for understanding the differences in semantic rules between two languages. Especially for Indo-European languages, subword-level models have achieved impressive results. However, when the translation task involves Chinese, semantic granularity remains at the word and character level, so there is still need more fine-grained translation model of Chinese. In this paper, we introduce a simple and effective method for Chinese translation at the sub-character level. Our approach uses the Wubi method to translate Chinese into English; byte-pair encoding (BPE) is then applied. Our method for Chinese-English translation eliminates the need for a complicated word segmentation algorithm during preprocessing. Furthermore, our method allows for sub-character-level neural translation based on recurrent neural network (RNN) architecture, without preprocessing. The empirical results show that for Chinese-English translation tasks, our sub-character-level model has a comparable BLEU score to the subword model, despite having a much smaller vocabulary. Additionally, the small vocabulary is highly advantageous for NMT model compression.

View on arXiv PDF

Similar