CVDec 23, 2021

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

arXiv:2112.12703v1
Originality Synthesis-oriented
AI Analysis

This work addresses the need for efficient layout analysis in historical document digitization for archivists and scholars, though it is incremental as it applies existing methods to a new data source.

The paper tackled the problem of training layout analysis models for historical printed books by using semantic markup from digital editions as distant supervision, achieving a high correlation between region-level evaluation methods and pixel-level/word-level metrics on the half-million pages of the Deutsches Textarchiv (DTA).

Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes