CL IR LGJun 24, 2014

Scalable Topical Phrase Mining from Text Corpora

Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare Voss, Jiawei Han

arXiv:1406.6312v2212 citations

Originality Incremental advance

AI Analysis

This addresses the need for more interpretable and scalable topical phrase mining in domains like research publications, reviews, and news articles, offering an incremental improvement over existing methods.

The paper tackles the problem of discovering topical phrases of mixed lengths from text corpora, proposing a novel approach that combines phrase mining and topic modeling to achieve high-quality phrase discovery with negligible extra computational cost compared to bag-of-words models.

While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the inference results of unigram-based topic models, or utilizes complex n-gram-discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.

View on arXiv PDF

Similar