CVMar 16

The COTe score: A decomposable framework for evaluating Document Layout Analysis models

Jonathan Bourne, Mwiza Simbeye, Ishtar Govia

arXiv:2603.1271833.71 citationsh-index: 2

Predicted impact top 83% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the need for more robust and comparable evaluation metrics in document layout analysis, benefiting researchers and practitioners in document processing, though it is incremental as it builds on existing evaluation frameworks.

The paper tackles the problem of evaluating Document Layout Analysis (DLA) models, which traditionally use object detection metrics like IoU or F1 that are ill-suited for 2D printed media, by introducing the COTe score, a decomposable metric that reduces the interpretation-performance gap by up to 76% relative to F1 and reveals distinct failure modes across models.

Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe's granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

View on arXiv PDF

Similar