CLIRLGJul 10, 2020

Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling

arXiv:2007.05163v1
Originality Incremental advance
AI Analysis

This work addresses a specific limitation in hierarchical topic modeling for researchers and practitioners, but it is incremental as it builds on existing HLTA methods.

The authors tackled the problem of multiword expressions in hierarchical latent tree analysis (HLTA) for topic modeling by proposing a preprocessing method to extract and select collocations, which improved HLTA's performance on three out of four datasets tested.

Topic modeling has been one of the most active research areas in machine learning in recent years. Hierarchical latent tree analysis (HLTA) has been recently proposed for hierarchical topic modeling and has shown superior performance over state-of-the-art methods. However, the models used in HLTA have a tree structure and cannot represent the different meanings of multiword expressions sharing the same word appropriately. Therefore, we propose a method for extracting and selecting collocations as a preprocessing step for HLTA. The selected collocations are replaced with single tokens in the bag-of-words model before running HLTA. Our empirical evaluation shows that the proposed method led to better performance of HLTA on three of the four data sets tested.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes