CLNov 12, 2020

Inference-only sub-character decomposition improves translation of unseen logographic characters

arXiv:2011.06523v1993 citations
AI Analysis

This addresses a specific issue in machine translation for logographic languages like Chinese and Japanese, offering a practical solution for handling unseen characters, though it is incremental in nature.

The paper tackles the problem of translating unseen logographic characters in neural machine translation by proposing an inference-only sub-character decomposition method, which improves translation adequacy without requiring retraining or additional models.

Neural Machine Translation (NMT) on logographic source languages struggles when translating `unseen' characters, which never appear in the training data. One possible approach to this problem uses sub-character decomposition for training and test sentences. However, this approach involves complete retraining, and its effectiveness for unseen character translation to non-logographic languages has not been fully explored. We investigate existing ideograph-based sub-character decomposition approaches for Chinese-to-English and Japanese-to-English NMT, for both high-resource and low-resource domains. For each language pair and domain we construct a test set where all source sentences contain at least one unseen logographic character. We find that complete sub-character decomposition often harms unseen character translation, and gives inconsistent results generally. We offer a simple alternative based on decomposition before inference for unseen characters only. Our approach allows flexible application, achieving translation adequacy improvements and requiring no additional models or training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes