CLApr 23, 2017

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

arXiv:1704.06986v134 citations
Originality Incremental advance
AI Analysis

This addresses the limitation of existing models in capturing bursty word distributions for open-vocabulary tasks, though it is incremental as it builds on prior hierarchical and character-level approaches.

The paper tackles the problem of fixed-vocabulary language models failing to handle the creation and reuse of new words in natural language, by augmenting a hierarchical LSTM model with a caching mechanism to reuse generated words, and demonstrates its effectiveness across 7 diverse languages using a new corpus.

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the "bursty" distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes