CLMar 1, 2019

Chinese-Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information

arXiv:1903.00149v110 citations
Originality Incremental advance
AI Analysis

This work addresses translation for Chinese-Japanese language users by introducing a novel approach for logographic languages, though it is incremental as it builds on existing UNMT methods.

The paper tackled unsupervised neural machine translation for Chinese-Japanese, a logographic language pair not previously studied, by using sub-character level data (ideograph or stroke) and found that stroke-level systems outperformed ideograph-level ones, enhancing performance beyond character-level data.

Unsupervised neural machine translation (UNMT) requires only monolingual data of similar language pairs during training and can produce bi-directional translation models with relatively good performance on alphabetic languages (Lample et al., 2018). However, no research has been done to logographic language pairs. This study focuses on Chinese-Japanese UNMT trained by data containing sub-character (ideograph or stroke) level information which is decomposed from character level data. BLEU scores of both character and sub-character level systems were compared against each other and the results showed that despite the effectiveness of UNMT on character level data, sub-character level data could further enhance the performance, in which the stroke level system outperformed the ideograph level system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes