CL IR LGJul 10, 2020

Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling

Leonard K. M. Poon, Nevin L. Zhang, Haoran Xie, Gary Cheng

arXiv:2007.05163v10.2

Originality Incremental advance

AI Analysis

This work addresses a specific limitation in hierarchical topic modeling for researchers and practitioners, but it is incremental as it builds on existing HLTA methods.

The authors tackled the problem of multiword expressions in hierarchical latent tree analysis (HLTA) for topic modeling by proposing a preprocessing method to extract and select collocations, which improved HLTA's performance on three out of four datasets tested.

Topic modeling has been one of the most active research areas in machine learning in recent years. Hierarchical latent tree analysis (HLTA) has been recently proposed for hierarchical topic modeling and has shown superior performance over state-of-the-art methods. However, the models used in HLTA have a tree structure and cannot represent the different meanings of multiword expressions sharing the same word appropriately. Therefore, we propose a method for extracting and selecting collocations as a preprocessing step for HLTA. The selected collocations are replaced with single tokens in the bag-of-words model before running HLTA. Our empirical evaluation shows that the proposed method led to better performance of HLTA on three of the four data sets tested.

View on arXiv PDF

Similar