CL LGMay 3, 2020

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

Xuanli He, Gholamreza Haffari, Mohammad Norouzi

arXiv:2005.06606v231.41018 citationsHas Code

Originality Incremental advance

AI Analysis

It addresses segmentation inefficiencies in machine translation, offering a novel method for output sentence segmentation with measurable gains, though it is incremental as it builds on existing subword techniques.

This paper tackles the problem of subword segmentation in neural machine translation by introducing Dynamic Programming Encoding (DPE), which marginalizes segmentation as a latent variable and uses a mixed character-subword transformer for exact inference, resulting in an average improvement of 0.9 BLEU over BPE and 0.55 BLEU over BPE dropout on WMT datasets.

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English <=> (German, Romanian, Estonian, Finnish, Hungarian).

View on arXiv PDF Code

Similar