OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment
This addresses the need for reliable OCR quality assessment in real-world applications, but it is incremental as it focuses on dataset creation rather than a new method.
They tackled the problem of evaluating OCR quality by creating OCR-Quality, a human-annotated dataset of 1,000 PDF pages with quality scores, which provides a benchmark for training and assessing OCR verification systems.
We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .