LG CL MLFeb 12, 2015

A Latent Variable Model Approach to PMI-based Word Embeddings

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski

arXiv:1502.03520v8135 citations

Originality Incremental advance

AI Analysis

This work addresses a foundational problem in natural language processing by offering theoretical insights into widely used word embedding techniques, though it is incremental in building on existing topic models.

The paper tackles the lack of theoretical justification for nonlinear word embedding methods like PMI, word2vec, and GloVe by proposing a new generative latent variable model, which provides closed-form expressions for word statistics and explains linear algebraic structures in embeddings.

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of~\citet{mnih2007three}. The methodological novelty is to use the prior to compute closed form expressions for word statistics. This provides a theoretical justification for nonlinear models like PMI, word2vec, and GloVe, as well as some hyperparameter choices. It also helps explain why low-dimensional semantic embeddings contain linear algebraic structure that allows solution of word analogies, as shown by~\citet{mikolov2013efficient} and many subsequent papers. Experimental support is provided for the generative model assumptions, the most important of which is that latent word vectors are fairly uniformly dispersed in space.

View on arXiv PDF

Similar