CLJan 12, 2023

A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities

arXiv:2301.04962v12 citationsh-index: 12
Originality Synthesis-oriented
AI Analysis

This work provides a resource for Kurdish NLP, which is an under-resourced language, but it is incremental as it amends an existing dataset.

The authors addressed the lack of named entity recognition (NER) resources for Kurdish (Sorani) by creating a dataset covering 11 categories with 33,261 entries, which is publicly available under a CC BY-NC-SA 4.0 license.

Named Entity Recognition (NER) is one of the essential applications of Natural Language Processing (NLP). It is also an instrument that plays a significant role in many other NLP applications, such as Machine Translation (MT), Information Retrieval (IR), and Part of Speech Tagging (POST). Kurdish is an under-resourced language from the NLP perspective. Particularly, in all the categories, the lack of NER resources hinders other aspects of Kurdish processing. In this work, we present a data set that covers several categories of NEs in Kurdish (Sorani). The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit). It covers 11 categories and 33261 entries in total. The dataset is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes