CLApr 16, 2020

Cross-lingual Contextualized Topic Models with Zero-shot Learning

arXiv:2004.07737v2816 citations
AI Analysis

This addresses the challenge of multilingual data analysis for researchers and practitioners in NLP, though it is incremental as it builds on transfer learning approaches.

The paper tackles the problem of applying topic models to multilingual datasets with parallel content but linguistic differences, by introducing a zero-shot cross-lingual topic model that learns topics in English and predicts them for unseen documents in other languages, showing that transferred topics are coherent and stable across languages.

Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in multiple languages. They all cover the same content, but the linguistic differences make it impossible to use traditional, bag-of-word-based topic models. Models have to be either single-language or suffer from a huge, but extremely sparse vocabulary. Both issues can be addressed by transfer learning. In this paper, we introduce a zero-shot cross-lingual topic model. Our model learns topics on one language (here, English), and predicts them for unseen documents in different languages (here, Italian, French, German, and Portuguese). We evaluate the quality of the topic predictions for the same document in different languages. Our results show that the transferred topics are coherent and stable across languages, which suggests exciting future research directions.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes