CLAIOct 26, 2024

A Survey of Large Language Models for Arabic Language and its Dialects

arXiv:2410.20238v220 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

It addresses the problem of underrepresentation and lack of resources for Arabic language processing, particularly for dialects, which is incremental as it synthesizes existing research.

This survey provides a comprehensive overview of Large Language Models (LLMs) for Arabic and its dialects, covering architectures, datasets, and performance on tasks like sentiment analysis, while highlighting challenges such as limited dialectal data and the need for openness.

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes