CV CL DBJun 25, 2025

From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents

arXiv:2506.20326v11 citationsh-index: 1

Originality Incremental advance

AI Analysis

It addresses the problem of automated processing for historical documents with complex layouts, showing incremental improvements in model selection and bounding box representation.

This paper benchmarks five object detection architectures for document layout analysis on three historical datasets, finding that Co-DETR achieves state-of-the-art results on structured layouts (0.752 mAP) while YOLOv11x-OBB outperforms others on complex datasets (0.564 and 0.568 mAP).

Robust Document Layout Analysis (DLA) is critical for the automated processing and understanding of historical documents with complex page organizations. This paper benchmarks five state-of-the-art object detection architectures on three annotated datasets representing a spectrum of codicological complexity: The e-NDP, a corpus of Parisian medieval registers (1326-1504); CATMuS, a diverse multiclass dataset derived from various medieval and modern sources (ca.12th-17th centuries) and HORAE, a corpus of decorated books of hours (ca.13th-16th centuries). We evaluate two Transformer-based models (Co-DETR, Grounding DINO) against three YOLO variants (AABB, OBB, and YOLO-World). Our findings reveal significant performance variations dependent on model architecture, data set characteristics, and bounding box representation. In the e-NDP dataset, Co-DETR achieves state-of-the-art results (0.752 mAP@.50:.95), closely followed by YOLOv11X-OBB (0.721). Conversely, on the more complex CATMuS and HORAE datasets, the CNN-based YOLOv11x-OBB significantly outperforms all other models (0.564 and 0.568, respectively). This study unequivocally demonstrates that using Oriented Bounding Boxes (OBB) is not a minor refinement but a fundamental requirement for accurately modeling the non-Cartesian nature of historical manuscripts. We conclude that a key trade-off exists between the global context awareness of Transformers, ideal for structured layouts, and the superior generalization of CNN-OBB models for visually diverse and complex documents.

View on arXiv PDF

Similar