TMT: A Simple Way to Translate Topic Models Using Dictionaries
This addresses the problem of multilingual topic modeling for developers lacking target language knowledge or data, though it appears incremental as it builds on existing topic models like LDA.
The paper tackles the challenge of training topic models for multilingual environments by introducing TMT, a technique that transfers topic models between languages without requiring aligned corpora or embeddings, and demonstrates it produces semantically coherent translations.
The training of topic models for a multilingual environment is a challenging task, requiring the use of sophisticated algorithms, topic-aligned corpora, and manual evaluation. These difficulties are further exacerbated when the developer lacks knowledge of the target language or is working in an environment with limited data, where only small or unusable multilingual corpora are available. Considering these challenges, we introduce Topic Model Translation (TMT), a novel, robust and transparent technique designed to transfer topic models (e.g., Latent Dirichlet Allocation (LDA) based topic models) from one language to another, without the need for metadata, embeddings, or aligned corpora. TMT enables the reuse of topic models across languages, making it especially suitable for scenarios where large corpora in the target language are unavailable or manual translation is infeasible. Furthermore, we evaluate TMT extensively using both quantitative and qualitative methods, demonstrating that it produces semantically coherent and consistent topic translations.