CLAug 24, 2021

More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

arXiv:2108.10755v11 citations
Originality Incremental advance
AI Analysis

This work addresses a specific challenge in natural language processing for languages like Chinese and Thai, offering an incremental improvement in topic modeling techniques.

The paper tackled the problem of improving topic modeling for languages without clear word boundaries by using collocation tokenization methods like chi-squared tests and Word Pair Encoding, showing that merged tokens produce clearer and more coherent topics compared to unmerged models.

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared test, t-statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The Chi-squared, t, and WPE tokenizers are trained on Wikipedia text to look for words that should be grouped together, such as compound nouns, proper nouns, and complex event verbs. We propose a new metric for measuring the clustering quality in settings where the vocabularies of the models differ. Based on this metric and other established metrics, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes