Fadi A Zaraket

CL
3papers
19citations
Novelty33%
AI Score36

3 Papers

CLDec 13, 2022Code
Lisan: Yemeni, Iraqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

Mustafa Jarrar, Fadi A Zaraket, Tymaa Hammouda et al.

This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.

CLJan 19
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

Abdellah El Mekki, Samar M. Magdy, Houdaifa Atou et al.

Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.

SEMay 17, 2013
Semantic Guidance and Feedback for the Construction of Specifications and Implementations

Paul C Attie, Fadi A Zaraket, Mohammad Fawaz et al.

The problem of writing a specification which accurately reflects the intent of the developer has long been recognized as fundamental. We propose a method and a supporting tool to write and check a specification and an implementation using a set of use cases, \ie input-output pairs that the developer supplies. These are instances of both good (correct) and bad (incorrect) behavior. We assume that the use cases are accurate, as it is easier to generate use cases than to write an accurate specification. We incrementally construct a specification (precondition and postcondition) based on semantic feedback generated from these use cases. We check the accuracy of the constructed specification using two proposed algorithms. The first algorithm checks the accuracy of the specification against an automatically generated specification from a supplied finite domain of use cases. The second checks the accuracy of the specification via reducing its domain to a finite yet equally satisfiable domain if possible. When the specification is mature, we start to also construct a program that satisfies the specification. However, our method makes provision for the continued modification of the specification, if needed. We illustrate our method with two examples; linear search and text justify.