CLLGNov 5, 2023

Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context

arXiv:2311.02777v1132 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This addresses generalization challenges for linguists documenting endangered languages with limited data, but it is incremental as it builds on existing methods for a specific domain.

The study tackled the problem of morpheme labeling models generalizing to unseen text genres in the endangered Mayan language Uspanteko, achieving a 2% improvement on out-of-distribution test data using strategies like weight decay optimization and pseudo-labeling.

Generalization is of particular importance in resource-constrained settings, where the available training data may represent only a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes