CLMay 31, 2019

Examining Structure of Word Embeddings with PCA

arXiv:1906.00114v113 citations
Originality Synthesis-oriented
AI Analysis

This work provides insights into embedding structures for Czech language processing, but it is incremental as it applies existing PCA methods to compare models without introducing new techniques.

The paper investigates the structure of Czech word embeddings across different models, finding that while part-of-speech (POS) information is present in word2vec embeddings, it is more organized in neural machine translation (NMT) decoder embeddings, suggesting its greater importance for translation tasks.

In this paper we compare structure of Czech word embeddings for English-Czech neural machine translation (NMT), word2vec and sentiment analysis. We show that although it is possible to successfully predict part of speech (POS) tags from word embeddings of word2vec and various translation models, not all of the embedding spaces show the same structure. The information about POS is present in word2vec embeddings, but the high degree of organization by POS in the NMT decoder suggests that this information is more important for machine translation and therefore the NMT model represents it in more direct way. Our method is based on correlation of principal component analysis (PCA) dimensions with categorical linguistic data. We also show that further examining histograms of classes along the principal component is important to understand the structure of representation of information in embeddings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes