CLDec 20, 2016

SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

arXiv:1612.06778v324 citations
Originality Incremental advance
AI Analysis

This addresses the need for efficient and effective document representation in natural language processing, offering improvements in performance and computational complexity, though it is incremental as it builds on existing distributional representations.

The paper tackles the problem of representing documents for text analysis by introducing SCDV, a feature vector technique that clusters word embeddings to capture multiple semantic contexts, outperforming the previous state-of-the-art method NTSG on multi-class and multi-label classification tasks and achieving significant reductions in training and prediction times.

We present a feature vector formation technique for documents - Sparse Composite Document Vector (SCDV) - which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embedding's are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-label classification tasks, we outperform the previous state-of-the-art method, NTSG (Liu et al., 2015a). We also show that SCDV embedding's perform well on heterogeneous tasks like Topic Coherence, context-sensitive Learning and Information Retrieval. Moreover, we achieve significant reduction in training and prediction times compared to other representation methods. SCDV achieves best of both worlds - better performance with lower time and space complexity.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes