LGDec 17, 2021

Quality of Data in Machine Learning

arXiv:2112.09400v116 citations
Originality Synthesis-oriented
AI Analysis

This refutes a common assumption in ML, suggesting data quality matters more than quantity, but it's incremental as it focuses on vocational student data.

The study empirically tested the assumption that more data improves machine learning performance, finding that increasing data records, sample frequency, or input features did not significantly boost model accuracies, though variance diminished for ensemble models.

A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing novel vocational student data. The experiment compared different machine learning algorithms while varying the number of data and feature combinations available for training and testing the models. The experiment revealed that the increase of data records or their sample frequency does not immediately lead to significant increases in the model accuracies or performance, however the variance of accuracies does diminish in the case of ensemble models. Similar phenomenon was witnessed while increasing the number of input features for the models. The study refutes the starting assumption and continues to state that in this case the significance in data lies in the quality of the data instead of the quantity of the data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes