CVJan 27, 2021

LS-HDIB: A Large Scale Handwritten Document Image Binarization Dataset

Kaustubh Sadekar, Ashish Tiwari, Prajwal Singh, Shanmuganathan Raman

arXiv:2101.11674v33.73 citations

Originality Synthesis-oriented

AI Analysis

This addresses a data bottleneck for researchers in document analysis, though it is incremental as it builds on existing datasets.

The authors tackled the problem of limited data for handwritten document image binarization by creating LS-HDIB, a large-scale dataset with over a million images, and showed that training deep learning models on it enhances performance on unseen images.

Handwritten document image binarization is challenging due to high variability in the written content and complex background attributes such as page style, paper quality, stains, shadow gradients, and non-uniform illumination. While the traditional thresholding methods do not effectively generalize on such challenging real-world scenarios, deep learning-based methods have performed relatively well when provided with sufficient training data. However, the existing datasets are limited in size and diversity. This work proposes LS-HDIB - a large-scale handwritten document image binarization dataset containing over a million document images that span numerous real-world scenarios. Additionally, we introduce a novel technique that uses a combination of adaptive thresholding and seamless cloning methods to create the dataset with accurate ground truths. Through an extensive quantitative and qualitative evaluation over eight different deep learning based models, we demonstrate the enhancement in the performance of these models when trained on the LS-HDIB dataset and tested on unseen images.

View on arXiv PDF

Similar