CLJan 30, 2018

PEYMA: A Tagged Corpus for Persian Named Entities

Mahsa Sadat Shahshahani, Mahdi Mohseni, Azadeh Shakery, Heshaam Faili

arXiv:1801.09936v11.019 citations

Originality Synthesis-oriented

AI Analysis

This provides a crucial dataset for Persian NLP tasks like question-answering, enabling research in a low-resource language.

The researchers tackled the lack of a standard Persian named entity recognition (NER) dataset by creating a free, large-scale tagged corpus from news texts, addressing a gap in NLP resources for Persian.

The goal in the NER task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many NLP tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies for this task in Persian. One of the main important causes of this may be the lack of a standard Persian NER dataset to train and test NER systems. In this research we create a standard, big-enough tagged Persian NER dataset which will be distributed for free for research purposes. In order to construct such a standard dataset, we studied standard NER datasets which are constructed for English researches and found out that almost all of these datasets are constructed using news texts. So we collected documents from ten news websites. Later, in order to provide annotators with some guidelines to tag these documents, after studying guidelines used for constructing CoNLL and MUC standard English datasets, we set our own guidelines considering the Persian linguistic rules.

View on arXiv PDF

Similar