CL LGAug 19, 2025

Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

arXiv:2508.13525v13 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of limited dialectal support in Arabic LLMs for users of Saudi dialects like Najdi and Hijazi, though it is incremental as it applies an existing method (LoRA) to a new dataset.

The paper tackled the underrepresentation of Saudi dialects in Arabic large language models by LoRA-tuning ALLaM-7B-Instruct-preview on a private Saudi Dialect Instruction dataset, achieving improved dialect control with the Dialect-Token model raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%.

Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.

View on arXiv PDF

Similar