CLFeb 15, 2023

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

arXiv:2302.07912v1268 citationsh-index: 28
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of automatic word alignment for low-resource languages, which is crucial for improving NLP tools in underserved linguistic communities, though it is incremental as it compares existing methods on new data.

The study evaluated modern transformer-based word alignment methods against traditional approaches on low-resource languages not included in pretraining data, finding that transformer-based methods generally outperform traditional models but remain competitive with them.

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes