CLOct 25, 2020

Autoencoding Improves Pre-trained Word Embeddings

arXiv:2010.13094v2997 citations
Originality Synthesis-oriented
AI Analysis

This work addresses an incremental improvement for NLP practitioners by refining existing embedding methods.

The paper tackled the problem of improving pre-trained word embeddings by analyzing their geometry, showing that retaining top principal components through autoencoding increases accuracy without extra data, contradicting prior work that suggested removing them.

Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically, this post-processing step is equivalent to applying a linear autoencoder to minimise the squared l2 reconstruction error. This result contradicts prior work (Mu and Viswanath, 2018) that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labelled data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes