LG MLMay 14, 2019

Task-Driven Data Verification via Gradient Descent

arXiv:1905.05843v11.0

Originality Incremental advance

AI Analysis

This addresses the issue of data quality for practitioners in supervised machine learning, offering a task-driven verification method applicable across various tasks, though it appears incremental as it builds on existing gradient-based techniques.

The paper tackles the problem of detecting corrupted or mislabeled samples in training datasets by introducing a novel algorithm, Corruption Detection via Gradient Descent (CDGD), which optimizes inclusion variables using gradient descent on gradient descent to improve network performance on a clean validation set, with quantitative comparisons on synthetic and real-world datasets.

We introduce a novel algorithm for the detection of possible sample corruption such as mislabeled samples in a training dataset given a small clean validation set. We use a set of inclusion variables which determine whether or not any element of the noisy training set should be included in the training of a network. We compute these inclusion variables by optimizing the performance of the network on the clean validation set via "gradient descent on gradient descent" based learning. The inclusion variables as well as the network trained in such a way form the basis of our methods, which we call Corruption Detection via Gradient Descent (CDGD). This algorithm can be applied to any supervised machine learning task and is not limited to classification problems. We provide a quantitative comparison of these methods on synthetic and real world datasets.

View on arXiv PDF

Similar