CL LGJun 29, 2024

eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey

Krzysztof Nowak, Jędrzej Ziębura, Krzysztof Wróbel, Aleksander Smywiński-Pohl

arXiv:2407.00418v11.0

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of processing Medieval Latin texts for researchers in historical linguistics and digital humanities, but it is incremental as it applies existing transformer methods to a new domain-specific corpus.

This study tackled the problem of automatic linguistic annotation for Medieval Latin texts by developing eFontes models for lemmatization, part-of-speech tagging, and morphological feature determination, achieving high accuracy rates of 92.60%, 83.29%, and 88.57%, respectively.

This study introduces the eFontes models for automatic linguistic annotation of Medieval Latin texts, focusing on lemmatization, part-of-speech tagging, and morphological feature determination. Using the Transformers library, these models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin. The research evaluates the models' performance, addressing challenges such as orthographic variations and the integration of Latinized vernacular terms. The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%. The findings underscore the importance of high-quality annotated corpora and propose future enhancements, including extending the models to Named Entity Recognition.

View on arXiv PDF

Similar