LG MLJan 11, 2019

Impact of Data Pruning on Machine Learning Algorithm Performance

Arun Thundyill Saseendran, Lovish Setia, Viren Chhabria, Debrup Chakraborty, Aneek Barman Roy

arXiv:1901.10539v11.86 citations

Originality Synthesis-oriented

AI Analysis

This incremental work addresses dataset optimization for machine learning practitioners.

The study investigated whether data pruning affects the relative performance of machine learning algorithms, finding that algorithms performing better on unpruned datasets also performed better on pruned datasets.

Dataset pruning is the process of removing sub-optimal tuples from a dataset to improve the learning of a machine learning model. In this paper, we compared the performance of different algorithms, first on an unpruned dataset and then on an iteratively pruned dataset. The goal was to understand whether an algorithm (say A) on an unpruned dataset performs better than another algorithm (say B), will algorithm B perform better on the pruned data or vice-versa. The dataset chosen for our analysis is a subset of the largest movie ratings database publicly available on the internet, IMDb [1]. The learning objective of the model was to predict the categorical rating of a movie among 5 bins: poor, average, good, very good, excellent. The results indicated that an algorithm that performed better on an unpruned dataset also performed better on a pruned dataset.

View on arXiv PDF

Similar