DB AI LG SEJun 24, 2024

SimClone: Detecting Tabular Data Clones using Value Similarity

Xu Yang, Gopi Krishnan Rajbahadur, Dayi Lin, Shaowei Wang, Zhen Ming, Jiang

arXiv:2407.12802v13.32 citations

Originality Incremental advance

AI Analysis

This addresses data management and licensing problems for AI developers working with tabular datasets, though it is incremental as it builds on prior clone detection methods.

The paper tackles the problem of detecting data clones in tabular datasets, which can cause management and licensing issues in AI software development, by proposing SimClone, a method that uses value similarity and visualization, and it outperforms the state-of-the-art by at least 20% in F1-score and AUC.

Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.

View on arXiv PDF

Similar