CLMar 15, 2016

Topic Modeling Using Distributed Word Embeddings

arXiv:1603.04747v12 citations
Originality Incremental advance
AI Analysis

This provides a more effective topic modeling method for analyzing user-generated content where topics are diffused, but it appears incremental as it builds on existing embedding techniques.

The authors tackled the problem of topic modeling by proposing Vec2Topic, an unsupervised algorithm that uses distributed word embeddings to identify and rank topics in a corpus. They found it outperforms Latent Dirichlet Allocation for user-generated content like emails and chats, and works well for non-user-generated content and small corpora.

We propose a new algorithm for topic modeling, Vec2Topic, that identifies the main topics in a corpus using semantic information captured via high-dimensional distributed word embeddings. Our technique is unsupervised and generates a list of topics ranked with respect to importance. We find that it works better than existing topic modeling techniques such as Latent Dirichlet Allocation for identifying key topics in user-generated content, such as emails, chats, etc., where topics are diffused across the corpus. We also find that Vec2Topic works equally well for non-user generated content, such as papers, reports, etc., and for small corpora such as a single-document.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes