CLNov 15, 2016

SimDoc: Topic Sequence Alignment based Document Similarity Framework

arXiv:1611.04822v28 citations
Originality Incremental advance
AI Analysis

This addresses document similarity for enterprise tasks such as text mining, but it is incremental as it builds on existing topic modeling and alignment methods.

The paper tackles document similarity by modeling documents as topic sequences and using sequence alignment, showing that SimDoc outperforms bag-of-words techniques in tasks like document clustering.

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes