CLAIApr 8, 2023

WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

arXiv:2304.04026v1261 citationsh-index: 35Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in NLP resources for Slovak language processing, though it is incremental as it adapts existing methods to a new dataset.

The authors tackled the lack of a high-quality annotated dataset for Slovak Named Entity Recognition by introducing WikiGoldSK, the first sizable human-labeled Slovak NER dataset, and showed that training on a silver-standard dataset yields better results in few-shot experiments.

Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range of practical applications. The performance of state-of-the-art NER methods depends on high quality manually anotated datasets which still do not exist for some languages. In this work we aim to remedy this situation in Slovak by introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models and comparing it to the existing silver-standard Slovak NER dataset. We also conduct few-shot experiments and show that training on a sliver-standard dataset yields better results. To enable future work that can be based on Slovak NER, we release the dataset, code, as well as the trained models publicly under permissible licensing terms at https://github.com/NaiveNeuron/WikiGoldSK.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes