CVJun 3, 2024

Khayyam Offline Persian Handwriting Dataset

Pourya Jafarzadeh, Padideh Choobdar, Vahid Mohammadi Safarzadeh

arXiv:2406.01025v12.0

Originality Synthesis-oriented

AI Analysis

This provides a large, unlabeled dataset for Persian handwriting recognition, addressing a gap in available resources for researchers in this domain-specific area, though it is incremental as it adds to existing datasets.

The authors tackled the lack of comprehensive datasets for Persian handwriting recognition by introducing the Khayyam dataset, which contains 44,000 words, 60,000 letters, and 6,000 digits collected from 400 native writers, and they demonstrated its applicability by training machine learning algorithms and reporting results.

Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.

View on arXiv PDF

Similar