CL LGJun 27, 2012

Two Step CCA: A new spectral method for estimating vector models of words

Paramveer Dhillon, Jordan Rodu, Dean Foster, Lyle Ungar

arXiv:1206.6403v110 citations

Originality Incremental advance

AI Analysis

This work provides an incremental improvement in spectral methods for natural language processing, benefiting researchers and practitioners in text analysis.

The paper tackles the problem of learning low-dimensional word representations from unlabeled text data by introducing a two-step CCA method that reduces sample complexity and improves performance on POS tagging and sentiment classification tasks.

Unlabeled data is often used to learn representations which can be used to supplement baseline features in a supervised learner. For example, for text applications where the words lie in a very high dimensional space (the size of the vocabulary), one can learn a low rank "dictionary" by an eigen-decomposition of the word co-occurrence matrix (e.g. using PCA or CCA). In this paper, we present a new spectral method based on CCA to learn an eigenword dictionary. Our improved procedure computes two set of CCAs, the first one between the left and right contexts of the given word and the second one between the projections resulting from this CCA and the word itself. We prove theoretically that this two-step procedure has lower sample complexity than the simple single step procedure and also illustrate the empirical efficacy of our approach and the richness of representations learned by our Two Step CCA (TSCCA) procedure on the tasks of POS tagging and sentiment classification.

View on arXiv PDF

Similar