CLAIMay 21, 2025

Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

arXiv:2505.16000v53 citationsh-index: 5ALIFE
Originality Incremental advance
AI Analysis

This provides a novel solution for Persian medical AI applications in resource-constrained environments, though it is incremental as it builds on existing fine-tuning methods.

This study tackled the problem of small language models struggling with specialized medical domains in low-resource Persian by curating a dataset from online sources and fine-tuning a baseline model, resulting in improved accuracy in medical question answering, passing the Iranian Basic Medical Science Entrance Exam, and a 2.67% increase in Persian-translated MMLU accuracy.

The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study introduces a newly curated dataset comprising 20k doctor-patient Q\&A pairs and 60\% of a 90-million-token crawled corpus from medical magazines. Using a parameter-efficient fine-tuning approach, we enhanced the medical knowledge of the baseline model, aya-expanse-8b. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and successfully passed the Iranian Basic Medical Science Entrance Exam (IBSEE) in September 2023, which the baseline model did not. Additionally, the fine-tuned model improved Persian-translated MMLU accuracy by an average of 2.67\%. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments. Future research could explore multimodal input to further enhance performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes