CLLGJul 24, 2023

Towards Generalising Neural Topical Representations

arXiv:2307.12564v41 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

This work addresses a practical limitation for users of topic models who need models to perform reliably across diverse datasets, though it is incremental as it builds on existing NTMs.

The paper tackles the problem of neural topic models (NTMs) lacking generalization across different corpora by proposing a plug-and-play module that uses text data augmentation and topical optimal transport distance to minimize semantic distance between similar documents, resulting in significant improvements in generalization ability as shown in extensive experiments.

Topic models have evolved from conventional Bayesian probabilistic models to recent Neural Topic Models (NTMs). Although NTMs have shown promising performance when trained and tested on a specific corpus, their generalisation ability across corpora has yet to be studied. In practice, we often expect that an NTM trained on a source corpus can still produce quality topical representation (i.e., latent distribution over topics) for the document from different target corpora to a certain degree. In this work, we aim to improve NTMs further so that their representation power for documents generalises reliably across corpora and tasks. To do so, we propose to enhance NTMs by narrowing the semantic distance between similar documents, with the underlying assumption that documents from different corpora may share similar semantics. Specifically, we obtain a similar document for each training document by text data augmentation. Then, we optimise NTMs further by minimising the semantic distance between each pair, measured by the Topical Optimal Transport (TopicalOT) distance, which computes the optimal transport distance between their topical representations. Our framework can be readily applied to most NTMs as a plug-and-play module. Extensive experiments show that our framework significantly improves the generalisation ability regarding neural topical representation across corpora. Our code and datasets are available at: https://github.com/Xiaohao-Yang/Topic_Model_Generalisation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes