CV LGApr 7

The Character Error Vector: Decomposable errors for page-level OCR evaluation

Jonathan Bourne, Mwiza Simbeye, Joseph Nockels

arXiv:2604.0616051.2

Predicted impact top 68% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a bottleneck in Document Understanding research by providing a practical metric for evaluating OCR when parsing is imperfect, though it is incremental as it builds on existing error rate concepts.

The paper tackles the problem that Character Error Rate (CER) becomes undefined under page-parsing errors, limiting page-level OCR evaluation, by introducing the Character Error Vector (CEV), a decomposable bag-of-characters evaluator that bridges parsing and OCR metrics, validated with an F1 of 0.91 for error source prediction.

The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV's performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

View on arXiv PDF

Similar