CLLGJul 8, 2017

Efficient Vector Representation for Documents through Corruption

arXiv:1707.02377v1121 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient and high-quality document representations for NLP tasks like sentiment analysis and classification, offering a simple yet effective method that is incremental over existing approaches.

The authors tackled the problem of learning efficient document representations by introducing Doc2VecC, a framework that uses a corruption model for regularization, resulting in word embeddings that significantly outperform Word2Vec and match or exceed state-of-the-art methods in tasks like sentiment analysis and document classification, with training speeds of billions of words per hour on a single machine.

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes