CLFeb 19, 2021

Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT

arXiv:2102.09749v232.7801 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of dialect identification in Arabic social media data, which is incremental as it applies existing methods to a new dataset from a shared task.

The paper tackled the problem of identifying the geographical origin of Arabic tweets, including both Modern Standard Arabic and dialects, by using Farasa segmentation and Transformer models like AraBERT and AraELECTRA. The results included macro F1-scores of 0.216, 0.235, 0.054, and 0.043 across subtasks, achieving second and fourth rankings in specific subtasks.

This paper presents our approach to address the EACL WANLP-2021 Shared Task 1: Nuanced Arabic Dialect Identification (NADI). The task is aimed at developing a system that identifies the geographical location(country/province) from where an Arabic tweet in the form of modern standard Arabic or dialect comes from. We solve the task in two parts. The first part involves pre-processing the provided dataset by cleaning, adding and segmenting various parts of the text. This is followed by carrying out experiments with different versions of two Transformer based models, AraBERT and AraELECTRA. Our final approach achieved macro F1-scores of 0.216, 0.235, 0.054, and 0.043 in the four subtasks, and we were ranked second in MSA identification subtasks and fourth in DA identification subtasks.

View on arXiv PDF

Similar