CV LGJan 28, 2022

Self-paced learning to improve text row detection in historical documents with missing labels

Mihaela Gaman, Lida Ghadamiyan, Radu Tudor Ionescu, Marius Popescu

arXiv:2201.12216v31.4

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in optical character recognition for historical document analysis, but it is incremental as it builds on existing detection methods like YOLOv4.

The paper tackles the problem of text row detection in historical documents with missing labels by proposing a self-paced learning algorithm that sorts training examples by annotation completeness and iteratively adds pseudo-lounding boxes, resulting in average precision improvements of over 12% and 39% on two datasets.

An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.

View on arXiv PDF

Similar