CLDec 16, 2023

USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification

arXiv:2312.10536v1134 citationsh-index: 8ARABICNLP
Originality Synthesis-oriented
AI Analysis

This work addresses dialect identification for Arabic language processing, but it is incremental as it applies existing methods to a specific dataset.

The paper tackled Arabic dialect identification at the country level by exploring preprocessing and feature engineering strategies, achieving an F1 score of 62.51%, which was close to the average of 72.91% from other systems.

In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI'2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F1 score of 62.51%. This achievement closely aligns with the average F1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes