CL LGNov 26, 2020

Decoding and Diversity in Machine Translation

Nicholas Roberts, Davis Liang, Graham Neubig, Zachary C. Lipton

arXiv:2011.13477v12.624 citations

Originality Incremental advance

AI Analysis

This research highlights a fundamental limitation for NMT developers, showing that current systems cannot achieve high BLEU scores while maintaining human-level diversity, and identifies search as a source of gender bias.

This paper investigates the trade-off between BLEU score and translation diversity in Neural Machine Translation (NMT) systems. It finds that while search strategies improve BLEU, they lead to deterministic outputs lacking human-level diversity and bias the distribution of translated gender pronouns.

Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers employ a variety of heuristic techniques, including searching for the conditional mode (vs. sampling) and incorporating various training heuristics (e.g., label smoothing). While search strategies significantly improve BLEU score, they yield deterministic outputs that lack the diversity of human translations. Moreover, search tends to bias the distribution of translated gender pronouns. This makes human-level BLEU a misleading benchmark in that modern MT systems cannot approach human-level BLEU while simultaneously maintaining human-level translation diversity. In this paper, we characterize distributional differences between generated and real translations, examining the cost in diversity paid for the BLEU scores enjoyed by NMT. Moreover, our study implicates search as a salient source of known bias when translating gender pronouns.

View on arXiv PDF

Similar