DBLGMLMar 16, 2018

Impacts of Dirty Data: and Experimental Evaluation

arXiv:1803.06071v219 citations
Originality Synthesis-oriented
AI Analysis

It addresses the problem of data quality impacts for data mining and machine learning practitioners, but it is incremental as it builds on known issues with rare prior research.

This paper experimentally evaluates how missing, inconsistent, and conflicting data affect classification and clustering algorithms, providing guidelines for algorithm selection and data cleaning based on the findings.

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification and clustering algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes