CL AI LGSep 15, 2023

AlbNER: A Corpus for Named Entity Recognition in Albanian

arXiv:2309.08741v10.91 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This provides a resource for Albanian NLP, but it is incremental as it applies existing methods to new data.

The paper tackles the lack of annotated text corpora for under-resourced languages by introducing AlbNER, a corpus of 900 sentences with labeled named entities from Albanian Wikipedia, and finds that language transfer significantly impacts NER performance while model size has a slight effect.

Scarcity of resources such as annotated text corpora for under-resourced languages like Albanian is a serious impediment in computational linguistics and natural language processing research. This paper presents AlbNER, a corpus of 900 sentences with labeled named entities, collected from Albanian Wikipedia articles. Preliminary results with BERT and RoBERTa variants fine-tuned and tested with AlbNER data indicate that model size has slight impact on NER performance, whereas language transfer has a significant one. AlbNER corpus and these obtained results should serve as baselines for future experiments.

View on arXiv PDF

Similar