Nonparametric Bayes Pachinko Allocation
This work addresses the challenge of reducing manual tuning in topic modeling for researchers and practitioners, though it is incremental as it builds on existing PAM and HDP methods.
The paper tackles the problem of automatically learning both the number of topics and their correlation structure in topic modeling, proposing a nonparametric Bayesian prior for the Pachinko Allocation Model (PAM) based on a hierarchical Dirichlet process variant. It shows that this approach achieves performance matching the best manually tuned PAM on synthetic and real-world text datasets.
Recent advances in topic models have explored complicated structured distributions to represent topic correlation. For example, the pachinko allocation model (PAM) captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). While PAM provides more flexibility and greater expressive power than previous models like latent Dirichlet allocation (LDA), it is also more difficult to determine the appropriate topic structure for a specific dataset. In this paper, we propose a nonparametric Bayesian prior for PAM based on a variant of the hierarchical Dirichlet process (HDP). Although the HDP can capture topic correlations defined by nested data structure, it does not automatically discover such correlations from unstructured data. By assuming an HDP-based prior for PAM, we are able to learn both the number of topics and how the topics are correlated. We evaluate our model on synthetic and real-world text datasets, and show that nonparametric PAM achieves performance matching the best of PAM without manually tuning the number of topics.