CLFeb 15, 2023

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John E. Ortega, Gustavo A. Giménez-Lugo, Rolando Coto-Solano, Katharina Kann

arXiv:2302.07912v128.2268 citationsh-index: 28Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of automatic word alignment for low-resource languages, which is crucial for improving NLP tools in underserved linguistic communities, though it is incremental as it compares existing methods on new data.

The study evaluated modern transformer-based word alignment methods against traditional approaches on low-resource languages not included in pretraining data, finding that transformer-based methods generally outperform traditional models but remain competitive with them.

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.

View on arXiv PDF Code

Similar