CLSep 12, 2025

Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records

Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban

arXiv:2509.10108v14.91 citationsh-index: 2AICCSA

Originality Incremental advance

AI Analysis

This addresses the problem of limited Arabic medical NLP resources for healthcare chatbot developers, though it is incremental as it builds on existing datasets and methods.

The researchers tackled the scarcity of Arabic medical chatbot training data by generating 80,000 synthetic patient records to expand the corpus to 100,000 records, which improved model performance with ChatGPT-4o data yielding higher F1-scores and fewer hallucinations across five fine-tuned LLMs.

The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.

View on arXiv PDF

Similar