ML LGMay 17, 2018

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Xin Bing, Florentina Bunea, Marten Wegkamp

arXiv:1805.06837v37.325 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of topic modeling in text analysis for researchers and practitioners, offering improved performance without prior knowledge of K, though it is an incremental advance over existing methods.

The authors tackled the problem of estimating topic models with an unknown number of topics by proposing a new method that estimates K from data and derived finite sample minimax bounds, showing it is faster and more accurate than existing methods in simulations.

We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.

View on arXiv PDF

Similar