Strategies for Training Large Vocabulary Neural Language Models
This work addresses scalability issues for applications like speech recognition and machine translation, but it is incremental as it focuses on comparing and extending existing methods.
The paper tackled the computational cost of training neural language models with large vocabularies by systematically comparing strategies like softmax variants and noise contrastive estimation, evaluating them on benchmarks for performance on rare words and speed/accuracy trade-offs.
Training neural network language models over large vocabularies is still computationally very costly compared to count-based models such as Kneser-Ney. At the same time, neural language models are gaining popularity for many applications such as speech recognition and machine translation whose success depends on scalability. We present a systematic comparison of strategies to represent and train large vocabularies, including softmax, hierarchical softmax, target sampling, noise contrastive estimation and self normalization. We further extend self normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax. We evaluate each method on three popular benchmarks, examining performance on rare words, the speed/accuracy trade-off and complementarity to Kneser-Ney.