CV AIMay 25, 2023

CENSUS-HWR: a large training dataset for offline handwriting recognition

Chetan Joshi, Lawry Sorenson, Ammon Wolfert, Mark Clement, Joseph Price, Kasey Buckles

arXiv:2305.16275v1

Originality Synthesis-oriented

AI Analysis

This provides a large benchmark dataset for researchers in handwriting recognition, addressing overfitting issues in existing small datasets, though it is incremental as it focuses on data collection rather than novel methods.

The authors tackled the lack of large training datasets for offline handwriting recognition by introducing CENSUS-HWR, a new dataset with 1,812,014 grayscale images and 1,865,134 handwritten texts from a vocabulary of 10,711 English words, extracted from US censuses.

Progress in Automated Handwriting Recognition has been hampered by the lack of large training datasets. Nearly all research uses a set of small datasets that often cause models to overfit. We present CENSUS-HWR, a new dataset consisting of full English handwritten words in 1,812,014 gray scale images. A total of 1,865,134 handwritten texts from a vocabulary of 10,711 words in the English language are present in this collection. This dataset is intended to serve handwriting models as a benchmark for deep learning algorithms. This huge English handwriting recognition dataset has been extracted from the US 1930 and 1940 censuses taken by approximately 70,000 enumerators each year. The dataset and the trained model with their weights are freely available to download at https://censustree.org/data.html.

View on arXiv PDF

Similar