CLFeb 17

Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

arXiv:2602.15753v12 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

It addresses annotation challenges for low-resource historical languages, offering a credible aid for linguists, though it is incremental in applying existing LLMs to new data.

This paper tackled lemmatization and POS-tagging for four under-resourced historical languages (Ancient Greek, Classical Armenian, Old Georgian, Syriac) using LLMs like GPT-4 and Mistral in few-shot/zero-shot settings, finding that LLMs achieved competitive or superior performance compared to a baseline RNN model across most languages.

Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes