ML CL IRDec 18, 2017

Multilingual Topic Models

Kriste Krstovski, Michael J. Kurtz, David A. Smith, Alberto Accomazzi

arXiv:1712.06704v11.02 citations

Originality Incremental advance

AI Analysis

This work addresses indexing and retrieval challenges in information science, but it appears incremental as it builds on existing latent variable models for parallel document representations.

The authors tackled the problem of vocabulary mismatch in scientific publications by proposing a multilingual topic model that treats different document representations as translations from a common latent representation, enabling evaluation of topical similarity and identification of concept vocabulary improvements.

Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.

View on arXiv PDF

Similar