CLLGNov 26, 2020

Decoding and Diversity in Machine Translation

arXiv:2011.13477v124 citations
AI Analysis

This research highlights a fundamental limitation for NMT developers, showing that current systems cannot achieve high BLEU scores while maintaining human-level diversity, and identifies search as a source of gender bias.

This paper investigates the trade-off between BLEU score and translation diversity in Neural Machine Translation (NMT) systems. It finds that while search strategies improve BLEU, they lead to deterministic outputs lacking human-level diversity and bias the distribution of translated gender pronouns.

Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers employ a variety of heuristic techniques, including searching for the conditional mode (vs. sampling) and incorporating various training heuristics (e.g., label smoothing). While search strategies significantly improve BLEU score, they yield deterministic outputs that lack the diversity of human translations. Moreover, search tends to bias the distribution of translated gender pronouns. This makes human-level BLEU a misleading benchmark in that modern MT systems cannot approach human-level BLEU while simultaneously maintaining human-level translation diversity. In this paper, we characterize distributional differences between generated and real translations, examining the cost in diversity paid for the BLEU scores enjoyed by NMT. Moreover, our study implicates search as a salient source of known bias when translating gender pronouns.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes