CLAug 13, 2024

Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time

Princeton
arXiv:2408.06675v126 citationsh-index: 10
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of morphological tagging for Latin texts across different time periods, providing a standardized evaluation framework, but it is incremental as it builds on existing treebanks and methods.

The authors reviewed existing Latin treebanks to assess their coverage and heterogeneity, then created standardized data splits for cross-time analysis of morphological tagging, finding that BERT-based taggers outperform others and are more robust to domain shifts.

Existing Latin treebanks draw from Latin's long written tradition, spanning 17 centuries and a variety of cultures. Recent efforts have begun to harmonize these treebanks' annotations to better train and evaluate morphological taggers. However, the heterogeneity of these treebanks must be carefully considered to build effective and reliable data. In this work, we review existing Latin treebanks to identify the texts they draw from, identify their overlap, and document their coverage across time and genre. We additionally design automated conversions of their morphological feature annotations into the conventions of standard Latin grammar. From this, we build new time-period data splits that draw from the existing treebanks which we use to perform a broad cross-time analysis for POS and morphological feature tagging. We find that BERT-based taggers outperform existing taggers while also being more robust to cross-domain shifts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes