CLFeb 26, 2019

Improving a tf-idf weighted document vector embedding

arXiv:1902.09875v132 citations
Originality Synthesis-oriented
AI Analysis

This work addresses document representation for tasks like review analysis, but it is incremental as it builds on existing methods like weighted sums and common component removal.

The paper tackled the problem of computing dense vector embeddings for documents using word vectors, finding that inverse document frequency weighting and common component removal improve performance, with idf performing best in their applications.

We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe. We describe two methods that can improve upon a simple weighted sum, that are optimal in the sense that they maximizes a particular weighted cosine similarity measure. We consider several weighting functions, including inverse document frequency (idf), smooth inverse frequency (SIF), and the sub-sampling function used in word2vec. We find that idf works best for our applications. We also use common component removal proposed by Arora et al. as a post-process and find it is helpful in most cases. We compare these embeddings variations to the doc2vec embedding on a new evaluation task using TripAdvisor reviews, and also on the CQADupStack benchmark from the literature.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes