AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic
This addresses the issue of social inequalities and limited applications for dialectal Arabic users, though it is incremental as it focuses on evaluation rather than new model development.
The paper tackled the problem of evaluating large language models (LLMs) in dialectal Arabic, which is under-served and lacks performance measurements, by developing a framework to assess nine LLMs across eight varieties, finding that LLMs understand dialectal Arabic better than they generate it due to reluctance rather than poor fluency.
Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits LLM applications, yet the research community lacks operationalized performance measurements in DA. We present a framework that comprehensively assesses LLMs' DA modeling capabilities across four dimensions: fidelity, understanding, quality, and diglossia. We evaluate nine LLMs in eight DA varieties and provide practical recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, not because their DA fluency is poor, but because they are reluctant to generate DA. Further analysis suggests that current post-training can contribute to bias against DA, that few-shot examples can overcome this deficiency, and that otherwise no measurable features of input text correlate well with LLM DA performance.