CL AIJan 12, 2025

A Comprehensive Evaluation of Large Language Models on Mental Illnesses in Arabic Context

Noureldin Zahran, Aya E. Fouda, Radwa J. Hanafy, Mohammed E. Fouda

arXiv:2501.06859v18 citationsh-index: 25

Originality Synthesis-oriented

AI Analysis

This work addresses the need for accessible mental health tools in the Arab world by providing insights into optimizing LLMs for culturally sensitive applications, though it is incremental as it focuses on evaluation and optimization of existing methods.

The study evaluated 8 large language models on mental health datasets in Arabic contexts, finding that prompt engineering improved scores by 14.5% on average, few-shot prompting boosted accuracy by a factor of 1.58 for GPT-4o Mini, and model selection was crucial for tasks like binary classification and severity prediction.

Mental health disorders pose a growing public health concern in the Arab world, emphasizing the need for accessible diagnostic and intervention tools. Large language models (LLMs) offer a promising approach, but their application in Arabic contexts faces challenges including limited labeled datasets, linguistic complexity, and translation biases. This study comprehensively evaluates 8 LLMs, including general multi-lingual models, as well as bi-lingual ones, on diverse mental health datasets (such as AraDepSu, Dreaddit, MedMCQA), investigating the impact of prompt design, language configuration (native Arabic vs. translated English, and vice versa), and few-shot prompting on diagnostic performance. We find that prompt engineering significantly influences LLM scores mainly due to reduced instruction following, with our structured prompt outperforming a less structured variant on multi-class datasets, with an average difference of 14.5\%. While language influence on performance was modest, model selection proved crucial: Phi-3.5 MoE excelled in balanced accuracy, particularly for binary classification, while Mistral NeMo showed superior performance in mean absolute error for severity prediction tasks. Few-shot prompting consistently improved performance, with particularly substantial gains observed for GPT-4o Mini on multi-class classification, boosting accuracy by an average factor of 1.58. These findings underscore the importance of prompt optimization, multilingual analysis, and few-shot learning for developing culturally sensitive and effective LLM-based mental health tools for Arabic-speaking populations.

View on arXiv PDF

Similar