A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities
This work provides a resource for Kurdish NLP, which is an under-resourced language, but it is incremental as it amends an existing dataset.
The authors addressed the lack of named entity recognition (NER) resources for Kurdish (Sorani) by creating a dataset covering 11 categories with 33,261 entries, which is publicly available under a CC BY-NC-SA 4.0 license.
Named Entity Recognition (NER) is one of the essential applications of Natural Language Processing (NLP). It is also an instrument that plays a significant role in many other NLP applications, such as Machine Translation (MT), Information Retrieval (IR), and Part of Speech Tagging (POST). Kurdish is an under-resourced language from the NLP perspective. Particularly, in all the categories, the lack of NER resources hinders other aspects of Kurdish processing. In this work, we present a data set that covers several categories of NEs in Kurdish (Sorani). The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit). It covers 11 categories and 33261 entries in total. The dataset is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/.