IRDec 8, 2017

A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

Hossein Azgomi, Masumeh Ghasemi Mahsayeh, Masoud Mohammadi, Milad Moradi

arXiv:1712.03190v12 citations

Originality Synthesis-oriented

AI Analysis

This is an incremental improvement for document similarity tasks in data mining.

The paper tackles the problem of finding similar documents in massive datasets by incorporating symbol repetition into length-based filtering, aiming to reduce comparisons and time.

A basic topic in mining of massive dataset is finding similar items. As an example, finding similar documents can be recommended. In this case many methods are existed. For example, Shingling method and length based filtering are one of them. In Shingling method, from each document, substrings have been selected with symbol name and, they are placed on one set. For finding similar documents, the similarities of sets that related with them have been calculated. In Length based filtering just documents which close these lengths have been compared. These methods don't consider repetition of symbols. With considering the repetition can calculate length of documents with more accurately. In this paper we suggested a method for finding similar documents with considering the repetition of symbols. This method separated documents to better form. The main goal of this paper is presentation a method for finding similar documents with take fewer comparisons and time indeed.

View on arXiv PDF

Similar