CLJul 18, 2024

AlcLaM: Arabic Dialectal Language Model

Murtadha Ahmed, Saghir Alfasly, Bo Wen, Jamaal Qasem, Mohammed Ahmed, Yunfeng Liu

arXiv:2407.13097v11.93 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses the lack of effective dialectal Arabic language models for NLP applications, though it is incremental as it builds on existing BERT-based methods.

The authors tackled the problem of poor performance of existing Arabic language models on regional dialects by constructing a 3.4M-sentence dialectal corpus from social media and training a BERT-based model, AlcLaM, which achieved superior performance on various Arabic NLP tasks despite using only 13 GB of text, representing 7.8% to 21.3% of the data used by comparable models.

Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%, 10.2%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at GitHub https://github.com/amurtadha/Alclam and HuggingFace https://huggingface.co/rahbi.

View on arXiv PDF Code

Similar