LSH methods for data deduplication in a Wikipedia artificial dataset
This work addresses data deduplication to improve model training efficiency by preventing skewed distributions, but it is incremental as it applies existing LSH methods to a new artificial dataset.
The paper tackled the problem of identifying and removing nearly redundant data in text datasets using locality sensitive hashing (LSH) models, achieving AUC scores over 0.9 with the best model reaching 0.96 on an artificial dataset from English Wikipedia.
This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.