LG CL DS IRNov 4, 2016

Generalized Topic Modeling

arXiv:1611.01259v12.74 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of topic modeling for researchers and practitioners by proposing a generalized framework that moves beyond traditional assumptions, though it appears incremental as it builds on existing multi-view or co-training settings without claiming broad SOTA results.

The paper tackles the problem of topic modeling by generalizing beyond the standard i.i.d. word assumption to allow topics as complex distributions over sequences of paragraphs, aiming to directly learn a document classifier that predicts topic mixtures without explicitly learning distributions, and presents conditions for efficient learning with discussions on noise tolerance and sample complexity.

Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution $\vec a_i$ over words, and a document is generated by first selecting a mixture $\vec w$ over topics, and then generating words i.i.d. from the associated mixture $A{\vec w}$. Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a broad generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we aim to learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. We present several natural conditions under which one can do this efficiently and discuss issues such as noise tolerance and sample complexity in this model. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning.

View on arXiv PDF

Similar