The Author-Topic Model for Authors and Documents
This work addresses document modeling for researchers and analysts by providing a way to incorporate authorship into topic analysis, though it is incremental as it builds directly on LDA.
The authors tackled the problem of modeling documents with authorship information by extending Latent Dirichlet Allocation to include author-topic distributions, applying it to datasets like 1,700 NIPS papers and 160,000 CiteSeer abstracts and showing applications such as author similarity and entropy.
We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.