Construction and Quality Evaluation of Heterogeneous Hierarchical Topic Models
This work addresses the challenge of building and evaluating hierarchical topic models for mixed data sources, representing an incremental improvement in topic modeling methodology.
The authors tackled the problem of constructing hierarchical topic models for heterogeneous data sources by proposing a representation as flat layers with edges, introducing quality measures for edges, and developing a heterogeneous algorithm that outperforms baseline approaches while preserving original model structures.
In our work, we propose to represent HTM as a set of flat models, or layers, and a set of topical hierarchies, or edges. We suggest several quality measures for edges of hierarchical models, resembling those proposed for flat models. We conduct an assessment experimentation and show strong correlation between the proposed measures and human judgement on topical edge quality. We also introduce heterogeneous algorithm to build hierarchical topic models for heterogeneous data sources. We show how making certain adjustments to learning process helps to retain original structure of customized models while allowing for slight coherent modifications for new documents. We evaluate this approach using the proposed measures and show that the proposed heterogeneous algorithm significantly outperforms the baseline concat approach. Finally, we implement our own ESE called Rysearch, which demonstrates the potential of ARTM approach for visualizing large heterogeneous document collections.