CLAILGSep 15, 2023

AlbNER: A Corpus for Named Entity Recognition in Albanian

arXiv:2309.08741v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This provides a resource for Albanian NLP, but it is incremental as it applies existing methods to new data.

The paper tackles the lack of annotated text corpora for under-resourced languages by introducing AlbNER, a corpus of 900 sentences with labeled named entities from Albanian Wikipedia, and finds that language transfer significantly impacts NER performance while model size has a slight effect.

Scarcity of resources such as annotated text corpora for under-resourced languages like Albanian is a serious impediment in computational linguistics and natural language processing research. This paper presents AlbNER, a corpus of 900 sentences with labeled named entities, collected from Albanian Wikipedia articles. Preliminary results with BERT and RoBERTa variants fine-tuned and tested with AlbNER data indicate that model size has slight impact on NER performance, whereas language transfer has a significant one. AlbNER corpus and these obtained results should serve as baselines for future experiments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes