LG CL IRSep 6, 2015

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

Gibran Fuentes-Pineda, Ivan Vladimir Meza-Ruiz

arXiv:1509.01771v22.13 citations

Originality Incremental advance

AI Analysis

This addresses the need for efficient topic mining in large text datasets, though it appears incremental as it builds on existing hashing and topic modeling techniques.

The paper tackles the problem of automatically mining topics from large-scale corpora by introducing Sampled Weighted Min-Hashing (SWMH), which generates topics as ordered subsets of vocabulary based on term co-occurrence, and evaluates it on datasets up to 4 million documents, showing competitive performance in classification tasks.

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.

View on arXiv PDF

Similar