CLJun 11, 2018

Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource

arXiv:1806.03757v11094 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of linguistic preservation for endangered languages like Griko, though it is incremental as it applies existing methods to a new dataset.

The paper tackles part-of-speech tagging for the endangered language Griko by creating a parallel Griko-Italian resource and evaluating methods, finding that a semi-supervised approach with cross-lingual transfer achieves 72.9% accuracy and active learning improves it by over 21 percentage points.

Most work on part-of-speech (POS) tagging is focused on high resource languages, or examines low-resource and active learning settings through simulated studies. We evaluate POS tagging techniques on an actual endangered language, Griko. We present a resource that contains 114 narratives in Griko, along with sentence-level translations in Italian, and provides gold annotations for the test set. Based on a previously collected small corpus, we investigate several traditional methods, as well as methods that take advantage of monolingual data or project cross-lingual POS tags. We show that the combination of a semi-supervised method with cross-lingual transfer is more appropriate for this extremely challenging setting, with the best tagger achieving an accuracy of 72.9%. With an applied active learning scheme, which we use to collect sentence-level annotations over the test set, we achieve improvements of more than 21 percentage points.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes