CLIRLGDec 10, 2021

LSH methods for data deduplication in a Wikipedia artificial dataset

arXiv:2112.11478v1
Originality Synthesis-oriented
AI Analysis

This work addresses data deduplication to improve model training efficiency by preventing skewed distributions, but it is incremental as it applies existing LSH methods to a new artificial dataset.

The paper tackled the problem of identifying and removing nearly redundant data in text datasets using locality sensitive hashing (LSH) models, achieving AUC scores over 0.9 with the best model reaching 0.96 on an artificial dataset from English Wikipedia.

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes