CL AI LGAug 22, 2025

MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian

Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu

arXiv:2508.16390v22 citationsh-index: 8Has Codenpj Digital Medicine

Originality Synthesis-oriented

AI Analysis

This provides a new benchmark for medical QA in Romanian, addressing a gap for researchers and practitioners in low-resource language NLP, though it is incremental as it applies existing methods to a new dataset.

The authors tackled the lack of medical question-answering datasets in Romanian by introducing MedQARo, a large-scale benchmark with 102,646 QA pairs, and found that fine-tuned large language models significantly outperform zero-shot models, indicating the need for domain- and language-specific tuning.

Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.

View on arXiv PDF Code

Similar