CLAICVDLMar 25, 2025

TRIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

arXiv:2503.22714v13 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This work provides a resource for joint HTR and NER research in historical texts, but it is incremental as it builds on existing collections.

The authors introduced TRIDIS, an open-source corpus of medieval and early modern manuscripts, by aggregating legacy collections and providing a unified overview with baseline experiments using TrOCR and MiniCPM2.5 to compare random and outlier-based test partitions.

This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes