IROct 7, 2018

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

arXiv:1810.03099v13.21 citations

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and scalable document similarity detection in search engines to improve indexing and result quality, but it appears incremental as it builds on existing techniques.

The paper tackles the problem of detecting duplicated and near-duplicated web pages in search engines by proposing a new batch text similarity approach, which is shown to be faster and more accurate than cosine similarity and Simhash algorithms on the NEWS20 dataset.

The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to prevent indexing a web page multiple times. Furthermore, in the scoring phase, search engines need similarity measures to detect duplicated contents on web pages so as to increase the quality of their results. In this paper, a new approach to batch text similarity detection is proposed by combining some ideas from dimensionality reduction techniques and information gain theory. The new approach is focused on search engines need to detect duplicated and near-duplicated web pages. The new approach is evaluated on the NEWS20 dataset and the results show that the new approach is faster than the cosine text similarity algorithm in terms of speed and performance. On top of that, It is faster and more accurate than the other rival method, Simhash similarity algorithm.

View on arXiv PDF

Similar