LGAIMLFeb 18, 2025

Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework

arXiv:2502.13198v15 citationsh-index: 33Heliyon
Originality Incremental advance
AI Analysis

This addresses data quality issues for ML practitioners, but it is incremental as it combines existing methods like quality measurements and unsupervised learning in a new framework.

The paper tackles the problem of poor data quality limiting machine learning performance by proposing an unsupervised data-centric framework that identifies high-quality data, tested on anti-sense oligonucleotide datasets to guide efficient experiments and improve ML system performance.

Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes