ML IR LGSep 2, 2013

Scalable Probabilistic Entity-Topic Modeling

arXiv:1309.0337v1

Originality Incremental advance

AI Analysis

This work addresses entity disambiguation for applications like information retrieval, but it is incremental as it builds on existing LDA approaches with scalability improvements.

The paper tackles the challenge of training large-scale probabilistic entity-topic models for entity disambiguation by developing a distributed inference framework using a parallel Gibbs sampler and MapReduce pipelines, achieving state-of-the-art performance on a public dataset.

We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. Training such models is challenging because of the topic and vocabulary size, both in the millions. We tackle these problems using a novel distributed inference and representation framework based on a parallel Gibbs sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing fast and memory-frugal processing of large datasets. We report state-of-the-art performance on a public dataset.

View on arXiv PDF

Similar