Quality of Data in Machine Learning
This refutes a common assumption in ML, suggesting data quality matters more than quantity, but it's incremental as it focuses on vocational student data.
The study empirically tested the assumption that more data improves machine learning performance, finding that increasing data records, sample frequency, or input features did not significantly boost model accuracies, though variance diminished for ensemble models.
A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing novel vocational student data. The experiment compared different machine learning algorithms while varying the number of data and feature combinations available for training and testing the models. The experiment revealed that the increase of data records or their sample frequency does not immediately lead to significant increases in the model accuracies or performance, however the variance of accuracies does diminish in the case of ensemble models. Similar phenomenon was witnessed while increasing the number of input features for the models. The study refutes the starting assumption and continues to state that in this case the significance in data lies in the quality of the data instead of the quantity of the data.