LGAIJun 28, 2024

A Survey on Data Quality Dimensions and Tools for Machine Learning

arXiv:2406.19614v126 citationsHas Code
Originality Synthesis-oriented
AI Analysis

It addresses the challenge of data quality for machine learning practitioners, but is incremental as it synthesizes existing tools and trends.

This survey reviews 17 data quality evaluation and improvement tools from the last 5 years, comparing their strengths and limitations to propose a roadmap for developing open-source tools in machine learning.

Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes