LG CV MLJun 19, 2019

Training on test data: Removing near duplicates in Fashion-MNIST

arXiv:1906.08255v13 citations

Originality Synthesis-oriented

AI Analysis

This addresses data quality issues for researchers using Fashion-MNIST, but it is incremental as it builds on an existing dataset.

The paper tackled the problem of near-duplicate images between training and testing sets in Fashion-MNIST, which artificially inflates model accuracy, and resulted in a new dataset with these duplicates removed.

MNIST and Fashion MNIST are extremely popular for testing in the machine learning space. Fashion MNIST improves on MNIST by introducing a harder problem, increasing the diversity of testing sets, and more accurately representing a modern computer vision task. In order to increase the data quality of FashionMNIST, this paper investigates near duplicate images between training and testing sets. Near-duplicates between testing and training sets artificially increase the testing accuracy of machine learning models. This paper identifies near-duplicate images in Fashion MNIST and proposes a dataset with near-duplicates removed.

View on arXiv PDF

Similar