CLJun 22, 2023

Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain

Huda AlShuhayeb, Behrouz Minaei-Bidgoli, Mohammad E. Shenassa, Sayyed-Ali Hossayni

arXiv:2307.09630v2h-index: 38

Originality Synthesis-oriented

AI Analysis

This provides a standardized resource for researchers and practitioners working on Arabic natural language processing, specifically in religious contexts, though it is incremental as it builds on existing datasets.

The authors tackled the lack of a comprehensive dataset for evaluating Arabic word segmentation tools in the Hadith domain by creating Noor-Ghateh, a benchmark dataset with approximately 223,690 words from the 'Shariat al-Islam' book, which is superior in volume and word variety compared to existing datasets.

There are numerous complex and rich morphological features in the Arabic language, which are highly useful when analyzing traditional Arabic textbooks, especially in the literary and religious contexts, and help in understanding the meaning of the textbooks. Vocabulary separation means separating the word into different components, such as the root and affixes. In the morphological datasets, the variety of markers and the number of data samples help to evaluate the morphological techniques. In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book, labeled by human experts. In terms of volume and word variety, this dataset is superior to the other Hadith Arabic datasets, to the best of our knowledge. To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well. This be

View on arXiv PDF

Similar