CLOct 26, 2023

Arabic Fine-Grained Entity Recognition

arXiv:2310.17333v2133 citationsh-index: 13Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of fine-grained entity recognition resources for Arabic, which is incremental as it builds on an existing corpus and applies known methods.

The paper tackled the problem of fine-grained entity recognition in Arabic by extending an existing corpus with 31 subtypes for four main entity types, achieving high inter-annotator agreement (Cohen's Kappa 0.9861, F1 0.9889) and baseline F1 scores up to 0.920 with pre-trained BERT models.

Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes