CLMay 2, 2022

The Implicit Length Bias of Label Smoothing on Beam Search Decoding

DeepMind
arXiv:2205.00659v11 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses a subtle but impactful issue in neural machine translation for practitioners, as it reveals and corrects an implicit bias from a widely used technique.

The paper demonstrates that label smoothing in neural machine translation introduces a length bias during beam search decoding, causing shorter translations, and proposes a rectification method that improves translation quality by up to +2.8 BLEU on multiple tasks.

Label smoothing is ubiquitously applied in Neural Machine Translation (NMT) training. While label smoothing offers a desired regularization effect during model training, in this paper we demonstrate that it nevertheless introduces length biases in the beam search decoding procedure. Our analysis shows that label smoothing implicitly applies a length penalty term to output sequence, causing a bias towards shorter translations. We also show that for a model fully optimized with label smoothing, translation length is implicitly upper bounded by a fixed constant independent of input. We verify our theory by applying a simple rectification function at inference time to restore the unbiased distributions from the label-smoothed model predictions. This rectification method led to consistent quality improvements on WMT English-German, English-French, English-Czech and English-Chinese tasks, up to +0.3 BLEU at beam size 4 and +2.8 BLEU at beam size 200.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes