CLIRLGMLMay 9, 2012

Multilingual Topic Models for Unaligned Text

arXiv:1205.2657v1188 citations
Originality Highly original
AI Analysis

This provides a framework for multilingual topic modeling without curated parallel corpora, enabling topic model applications on a wider class of corpora.

The authors tackled the problem of analyzing multilingual text corpora without requiring parallel alignment by developing MuTo, a probabilistic model that simultaneously discovers language matching and multilingual latent topics using stochastic EM. They demonstrated MuTo successfully finds shared topics and pairs related documents across languages on real-world corpora.

We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes