IROct 25, 2015

Comparative Document Analysis for Large Text Corpora

Xiang Ren, Yuanhua Lv, Kuansan Wang, Jiawei Han

arXiv:1510.07197v17.722 citations

Originality Incremental advance

AI Analysis

This addresses the need for automated document comparison in fields like research and journalism, though it appears incremental as it builds on existing graph-based methods.

The paper tackles the problem of automatically identifying commonalities and differences between documents or document sets, called Comparative Document Analysis, using a graph-based framework and optimization, with experiments on scientific and news corpora showing effectiveness and robustness.

This paper presents a novel research problem on joint discovery of commonalities and differences between two individual documents (or document sets), called Comparative Document Analysis (CDA). Given any pair of documents from a document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction}, and guides the selection of sets of phrases by solving two joint optimization problems. We develop an iterative algorithm to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance in a mutually enhancing way. Experiments on text corpora from two different domains---scientific publications and news---demonstrate the effectiveness and robustness of the proposed method on comparing individual documents. Our case study on comparing news articles published at different dates shows the power of the proposed method on comparing document sets.

View on arXiv PDF

Similar