An Analysis of Neural Language Modeling at Multiple Scales
This work provides efficient, high-performance language models for NLP researchers and practitioners, but it is incremental as it builds on established architectures.
The paper tackled language modeling by extending existing LSTM and QRNN models to larger vocabularies and character-level granularity, achieving state-of-the-art results on datasets like Penn Treebank and WikiText-103 with training times of 12 hours to 2 days on a single GPU.
Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.