DB LG MLMar 16, 2018

Impacts of Dirty Data: and Experimental Evaluation

Zhixin Qi, Hongzhi Wang, Jianzhong Li, Hong Gao

arXiv:1803.06071v26.619 citations

Originality Synthesis-oriented

AI Analysis

It addresses the problem of data quality impacts for data mining and machine learning practitioners, but it is incremental as it builds on known issues with rare prior research.

This paper experimentally evaluates how missing, inconsistent, and conflicting data affect classification and clustering algorithms, providing guidelines for algorithm selection and data cleaning based on the findings.

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification and clustering algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.

View on arXiv PDF

Similar