CVMar 19, 2013

Handwritten and Printed Text Separation in Real Document

arXiv:1303.4614v125 citations
Originality Synthesis-oriented
AI Analysis

This addresses document digitization and archival for administrative and historical applications, though it appears incremental in combining existing techniques.

The paper tackles the problem of separating handwritten and printed text in noisy real documents with graphics and annotations, achieving close to 90% accuracy even with a small dataset of complex administrative documents.

The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset where samples are basically composed of complex administrative documents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes