CLJul 20, 2015

How to Generate a Good Word Embedding?

arXiv:1507.05523v1363 citations
Originality Synthesis-oriented
AI Analysis

This work provides practical guidelines for training word embeddings, addressing efficiency and effectiveness for NLP practitioners, but it is incremental as it systematizes and compares existing methods.

The paper analyzes the impact of model, corpus, and training parameters on word embedding quality, finding that corpus domain is more important than size, faster models suffice in most cases, and early stopping should use task-specific development sets.

We analyze three critical components of word embedding training: the model, the corpus, and the training parameters. We systematize existing neural-network-based word embedding algorithms and compare them using the same corpus. We evaluate each word embedding in three ways: analyzing its semantic properties, using it as a feature for supervised tasks and using it to initialize neural networks. We also provide several simple guidelines for training word embeddings. First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes