CVMay 19, 2022

Oracle-MNIST: a Dataset of Oracle Characters for Benchmarking Machine Learning Algorithms

arXiv:2205.09442v25.77 citationsh-index: 65Has Code

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark dataset for machine learning researchers, particularly in image classification, though it is incremental as it builds on the MNIST format.

The authors introduced Oracle-MNIST, a dataset of 30,222 ancient character images with 10 categories, designed for benchmarking pattern classification and offering challenges like noise and distortion from aging and varied writing styles. It is more difficult than MNIST but maintains compatibility with existing classifiers.

We introduce the Oracle-MNIST dataset, comprising of 28$\times $28 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern classification, with particular challenges on image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST shares the same data format with the original MNIST dataset, allowing for direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from 1) extremely serious and unique noises caused by three-thousand years of burial and aging and 2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research. The dataset is freely available at https://github.com/wm-bupt/oracle-mnist.

View on arXiv PDF Code

Similar