IRCLSep 5, 2017

Semantic Document Distance Measures and Unsupervised Document Revision Detection

arXiv:1709.01256v21088 citations
Originality Incremental advance
AI Analysis

This addresses document revision detection for large-scale corpora, representing an incremental improvement with novel methods for a known bottleneck.

The paper tackles the problem of detecting document revisions by modeling it as a minimum cost branching problem and proposing two new distance measures, wDTW and wTED, achieving more precise detection than state-of-the-art methods on Wikipedia and simulated datasets.

In this paper, we model the document revision detection problem as a minimum cost branching problem that relies on computing document distances. Furthermore, we propose two new document distance measures, word vector-based Dynamic Time Warping (wDTW) and word vector-based Tree Edit Distance (wTED). Our revision detection system is designed for a large scale corpus and implemented in Apache Spark. We demonstrate that our system can more precisely detect revisions than state-of-the-art methods by utilizing the Wikipedia revision dumps https://snap.stanford.edu/data/wiki-meta.html and simulated data sets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes