CL AIMay 19, 2022

Curras + Baladi: Towards a Levantine Corpus

Karim El Haff, Mustafa Jarrar, Tymaa Hammouda, Fadi Zaraket

arXiv:2205.09692v131.4592 citationsh-index: 29

Originality Synthesis-oriented

AI Analysis

This work addresses the need for dialect-specific resources in Arabic NLP, particularly for Levantine dialects, but it is incremental as it builds upon an existing corpus.

The authors tackled the challenge of processing Arabic dialects by creating the Lebanese Corpus Baladi, a morphologically annotated dataset of around 9.6K tokens, to enrich and correct the existing Palestinian Curras corpus, aiming to develop a more general Levantine corpus.

The processing of the Arabic language is a complex field of research. This is due to many factors, including the complex and rich morphology of Arabic, its high degree of ambiguity, and the presence of several regional varieties that need to be processed while taking into account their unique characteristics. When its dialects are taken into account, this language pushes the limits of NLP to find solutions to problems posed by its inherent nature. It is a diglossic language; the standard language is used in formal settings and in education and is quite different from the vernacular languages spoken in the different regions and influenced by older languages that were historically spoken in those regions. This should encourage NLP specialists to create dialect-specific corpora such as the Palestinian morphologically annotated Curras corpus of Birzeit University. In this work, we present the Lebanese Corpus Baladi that consists of around 9.6K morphologically annotated tokens. Since Lebanese and Palestinian dialects are part of the same Levantine dialectal continuum, and thus highly mutually intelligible, our proposed corpus was constructed to be used to (1) enrich Curras and transform it into a more general Levantine corpus and (2) improve Curras by solving detected errors.

View on arXiv PDF

Similar