Dependency Annotation of Ottoman Turkish with Multilingual BERT
This work addresses the problem of limited linguistic resources for Ottoman Turkish, facilitating automated analysis of historical documents, though it is incremental as it applies existing methods to a new language.
The study tackled the challenge of creating the first dependency treebank for Ottoman Turkish by using a multilingual BERT-based model to pseudo-annotate data, manually correct it, and fine-tune the model, which sped up and simplified the annotation process. The resulting treebank will be part of the Universal Dependencies project to enable automated analysis of Ottoman Turkish documents.
This study introduces a pretrained large language model-based annotation methodology for the first de dency treebank in Ottoman Turkish. Our experimental results show that, iteratively, i) pseudo-annotating data using a multilingual BERT-based parsing model, ii) manually correcting the pseudo-annotations, and iii) fine-tuning the parsing model with the corrected annotations, we speed up and simplify the challenging dependency annotation process. The resulting treebank, that will be a part of the Universal Dependencies (UD) project, will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.