CVMar 10

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

arXiv:2603.09470v14.6h-index: 5Has Code

Predicted impact top 89% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work provides a valuable resource for philologists and establishes a benchmark for OCR on noisy polytonic Greek, though it is incremental in applying existing methods to a new dataset.

The researchers tackled the challenge of digitizing nineteenth-century polytonic Greek texts with complex bilingual layouts, achieving a character error rate of 1.05% and a word error rate of 4.69%, and producing a corpus of six million annotated tokens.

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.

View on arXiv PDF Code

Similar