CLNov 13, 2025

ADI-20: Arabic Dialect Identification dataset and models

Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares

arXiv:2511.10070v14 citationsh-index: 19Has CodeINTERSPEECH

Originality Synthesis-oriented

AI Analysis

This work provides a dataset and models for Arabic Dialect Identification, which is an incremental contribution to the field of natural language processing for Arabic language communities.

The authors tackled the problem of Arabic Dialect Identification by extending a dataset to cover all Arabic-speaking countries' dialects, comprising 3,556 hours from 19 dialects and Modern Standard Arabic, and found that using only 30% of the training data resulted in a small decrease in F1 score.

We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.

View on arXiv PDF

Similar