A new simple and effective measure for bag-of-word inter-document similarity measurement
This addresses a specific issue in text mining for researchers and practitioners by offering an incremental improvement over existing similarity measures.
The paper tackles the problem of measuring similarity between documents in bag-of-words representations by proposing a new measure that avoids explicit term weighting, showing it yields better results in binary representations and competitive, more consistent results in term-frequency-based ones.
To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity---the most widely used inter-document similarity measure in text mining. In this paper, we identify the shortcomings of the underlying assumptions of term weighting in the inter-document similarity measurement task; and provide a more fit-to-the-purpose alternative. Based on this new assumption, we introduce a new simple but effective similarity measure which does not require explicit term weighting. The proposed measure employs a more nuanced probabilistic approach than those used in term weighting to measure the similarity of two documents w.r.t each term occurring in the two documents. Our empirical comparison with the existing similarity measures using different term weighting schemes shows that the new measure produces (i) better results in the binary BoW representation; and (ii) competitive and more consistent results in the term-frequency-based BoW representation.