MNIST-MIX: A Multi-language Handwritten Digit Recognition Dataset
This provides a new benchmark for researchers in computer vision and digit recognition to test models on more diverse and realistic data, though it is incremental as it builds on existing MNIST formats.
The authors tackled the problem of limited language diversity in handwritten digit recognition by creating MNIST-MIX, a multi-language dataset with digits from 10 languages, which is the largest of its type and more challenging due to imbalanced classification.
In this letter, we contribute a multi-language handwritten digit recognition dataset named MNIST-MIX, which is the largest dataset of the same type in terms of both languages and data samples. With the same data format with MNIST, MNIST-MIX can be seamlessly applied in existing studies for handwritten digit recognition. By introducing digits from 10 different languages, MNIST-MIX becomes a more challenging dataset and its imbalanced classification requires a better design of models. We also present the results of applying a LeNet model which is pre-trained on MNIST as the baseline.