Diacritization of Maghrebi Arabic Sub-Dialects
This work addresses a limited-resource problem for applications like Text-to-Speech in dialectal Arabic, focusing on specific sub-dialects, and is incremental as it applies existing methods to new data.
The paper tackles the problem of automatic diacritization for Maghrebi Arabic sub-dialects, specifically Tunisian and Moroccan, using a character-level deep neural network with bi-LSTM and CRF layers, achieving word error rates of 2.7% for Moroccan and 3.6% for Tunisian.
Diacritization process attempt to restore the short vowels in Arabic written text; which typically are omitted. This process is essential for applications such as Text-to-Speech (TTS). While diacritization of Modern Standard Arabic (MSA) still holds the lion share, research on dialectal Arabic (DA) diacritization is very limited. In this paper, we present our contribution and results on the automatic diacritization of two sub-dialects of Maghrebi Arabic, namely Tunisian and Moroccan, using a character-level deep neural network architecture that stacks two bi-LSTM layers over a CRF output layer. The model achieves word error rate of 2.7% and 3.6% for Moroccan and Tunisian respectively and is capable of implicitly identifying the sub-dialect of the input.