CLJun 29, 2021

New Arabic Medical Dataset for Diseases Classification

Jaafar Hammoud, Aleksandra Vatian, Natalia Dobrenko, Nikolai Vedernikov, Anatoly Shalyto, Natalia Gusarova

arXiv:2106.15236v30.27 citations

Originality Synthesis-oriented

AI Analysis

This provides a domain-specific resource for Arabic medical NLP, but it is incremental as it applies existing methods to new data.

The authors tackled the shortage of Arabic medical datasets by introducing a new dataset of 2,000 medical documents for text classification into 10 disease classes, and they fine-tuned pre-trained models like BERT, Arabert, and AraBioNER on it.

The Arabic language suffers from a great shortage of datasets suitable for training deep learning models, and the existing ones include general non-specialized classifications. In this work, we introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites, in addition to the Arab Medical Encyclopedia. The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological) diseases. Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.

View on arXiv PDF

Similar