Faster Transformer Decoding: N-gram Masked Self-Attention
This addresses efficiency issues in machine translation for practitioners, but it is incremental as it builds on existing Transformer methods.
The paper tackles the problem of slow Transformer decoding by proposing N-gram masked self-attention, which truncates the target-side window based on an N-gram assumption, resulting in minimal BLEU score loss (e.g., for N values 4-8) on WMT EnDe and EnFr datasets.
Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.