CLApr 3

BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition

Wazir Ali, Adeeb Noor, Sanaullah Mahar, Alia, Muhammad Mazhar Younas

arXiv:2604.0290410.2h-index: 5

Predicted impact top 66% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

This provides a reliable benchmark for Urdu language processing in the biomedical domain, addressing a gap for researchers and practitioners, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of a benchmark dataset for biomedical Urdu named entity recognition by creating BioUNER, a gold-standard dataset with 153K tokens and an inter-annotator agreement score of 0.78, and demonstrated its utility by evaluating models like SVM, LSTM, mBERT, and XLM-RoBERTa.

In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT (mBERT), and XLM-RoBERTa. The gold-standard BioUNER dataset serves as a reliable benchmark and a valuable addition to Urdu language processing resources.

View on arXiv PDF

Similar