CVLGMLNov 10, 2018

Handwriting Recognition of Historical Documents with few labeled data

arXiv:1811.07768v154 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of training handwriting recognition systems for historical documents where labeled data is scarce, which is an incremental improvement in a domain-specific context.

The paper tackles the problem of offline handwriting recognition for historical documents with limited labeled data by training a deep convolutional recurrent neural network on only 10% of manually labeled text-line data and using incremental training and data augmentation. The system achieved the second-best result in the ICDAR2017 competition on the READ dataset.

Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated textlines are needed to train an HTR system. In some scenarios, transcripts are only available at the paragraph level with no text-line information. In this work, we demonstrate how to train an HTR system with few labeled data. Specifically, we train a deep convolutional recurrent neural network (CRNN) system on only 10% of manually labeled text-line data from a dataset and propose an incremental training procedure that covers the rest of the data. Performance is further increased by augmenting the training set with specially crafted multiscale data. We also propose a model-based normalization scheme which considers the variability in the writing scale at the recognition phase. We apply this approach to the publicly available READ dataset. Our system achieved the second best result during the ICDAR2017 competition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes