CLApr 7, 2023

InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling

Xiaobao Wu, Xinshuai Dong, Thong Nguyen, Chaoqun Liu, Liangming Pan, Anh Tuan Luu

arXiv:2304.03544v27.835 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This addresses issues in cross-lingual text analysis for researchers and practitioners, but it is incremental as it builds on existing topic modeling methods.

The paper tackles the problems of repetitive topics and low-coverage dictionaries in cross-lingual topic modeling by proposing InfoCTM, which uses mutual information for topic alignment and a vocabulary linking method, resulting in outperforming state-of-the-art baselines in producing more coherent, diverse, and well-aligned topics with better transferability for classification tasks.

Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics. However, most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries. In this paper, we propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM). Instead of the direct alignment in previous work, we propose a topic alignment with mutual information method. This works as a regularization to properly align topics and prevent degenerate topic representations of words, which mitigates the repetitive topic issue. To address the low-coverage dictionary issue, we further propose a cross-lingual vocabulary linking method that finds more linked cross-lingual words for topic alignment beyond the translations of a given dictionary. Extensive experiments on English, Chinese, and Japanese datasets demonstrate that our method outperforms state-of-the-art baselines, producing more coherent, diverse, and well-aligned topics and showing better transferability for cross-lingual classification tasks.

View on arXiv PDF Code

Similar