CLLGMLApr 5, 2020

Finding the Optimal Vocabulary Size for Neural Machine Translation

arXiv:2004.02334v21007 citations
AI Analysis

This work addresses a key bottleneck in NMT for improving translation accuracy, though it appears incremental as it builds on known issues in classification and autoregression.

The paper investigates how vocabulary size affects neural machine translation performance, analyzing the impact of Zipfian language distributions on classifier balance and identifying optimal vocabulary sizes across multiple languages and data sizes.

We cast neural machine translation (NMT) as a classification task in an autoregressive setting and analyze the limitations of both classification and autoregression components. Classifiers are known to perform better with balanced class distributions during training. Since the Zipfian nature of languages causes imbalanced classes, we explore its effect on NMT. We analyze the effect of various vocabulary sizes on NMT performance on multiple languages with many data sizes, and reveal an explanation for why certain vocabulary sizes are better than others.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes