CLOct 15, 2024

A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek

arXiv:2410.12055v11.91 citationsh-index: 1Has CodeProceedings of The FirstWorkshop on Natural Language Processing and Language Models for Digital Humanities

Originality Synthesis-oriented

AI Analysis

This work addresses the need for improved natural language processing tools for Ancient Greek, but it is incremental as it compares and fine-tunes existing models rather than introducing a new method.

The paper tackled the problem of identifying a state-of-the-art morphosyntactic parser and lemmatizer for Ancient Greek by comparing six models, finding that Trankit best annotated syntax and GreTa best annotated lemmata, with results suggesting token embeddings alone are insufficient for high UAS and LAS scores.

This paper presents an experiment consisting in the comparison of six models to identify a state-of-the-art morphosyntactic parser and lemmatizer for Ancient Greek capable of annotating according to the Ancient Greek Dependency Treebank annotation scheme. A normalized version of the major collections of annotated texts was used to (i) train the baseline model Dithrax with randomly initialized character embeddings and (ii) fine-tune Trankit and four recent models pretrained on Ancient Greek texts, i.e., GreBERTa and PhilBERTa for morphosyntactic annotation and GreTA and PhilTa for lemmatization. A Bayesian analysis shows that Dithrax and Trankit annotate morphology practically equivalently, while syntax is best annotated by Trankit and lemmata by GreTa. The results of the experiment suggest that token embeddings are not sufficient to achieve high UAS and LAS scores unless they are coupled with a modeling strategy specifically designed to capture syntactic relationships. The dataset and best-performing models are made available online for reuse.

View on arXiv PDF Code

Similar