CLFeb 26, 2019

Improving a tf-idf weighted document vector embedding

arXiv:1902.09875v10.932 citations

Originality Synthesis-oriented

AI Analysis

This work addresses document representation for tasks like review analysis, but it is incremental as it builds on existing methods like weighted sums and common component removal.

The paper tackled the problem of computing dense vector embeddings for documents using word vectors, finding that inverse document frequency weighting and common component removal improve performance, with idf performing best in their applications.

We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe. We describe two methods that can improve upon a simple weighted sum, that are optimal in the sense that they maximizes a particular weighted cosine similarity measure. We consider several weighting functions, including inverse document frequency (idf), smooth inverse frequency (SIF), and the sub-sampling function used in word2vec. We find that idf works best for our applications. We also use common component removal proposed by Arora et al. as a post-process and find it is helpful in most cases. We compare these embeddings variations to the doc2vec embedding on a new evaluation task using TripAdvisor reviews, and also on the CQADupStack benchmark from the literature.

View on arXiv PDF

Similar