CLAIJun 22, 2025

QuranMorph: Morphologically Annotated Quranic Corpus

arXiv:2506.18148v15 citationsh-index: 11Has Code
Originality Synthesis-oriented
AI Analysis

This provides a foundational resource for computational linguistics and Quranic studies, enabling inter-linking with other linguistic databases, though it is incremental as it applies existing annotation methods to new data.

The researchers tackled the lack of a morphologically annotated corpus for the Quran by creating QuranMorph, a manually annotated corpus with 77,429 tokens, achieving lemmatization and part-of-speech tagging using expert linguists and a fine-grained tagset of 40 tags.

We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes