CLOct 25, 2020

Autoencoding Improves Pre-trained Word Embeddings

arXiv:2010.13094v231.1997 citations

Originality Synthesis-oriented

AI Analysis

This work addresses an incremental improvement for NLP practitioners by refining existing embedding methods.

The paper tackled the problem of improving pre-trained word embeddings by analyzing their geometry, showing that retaining top principal components through autoencoding increases accuracy without extra data, contradicting prior work that suggested removing them.

Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically, this post-processing step is equivalent to applying a linear autoencoder to minimise the squared l2 reconstruction error. This result contradicts prior work (Mu and Viswanath, 2018) that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labelled data.

View on arXiv PDF

Similar