CLSep 8, 2025

MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Ruggero Marino Lazzaroni, Alessandro Angioi, Michelangelo Puliga, Davide Sanna, Roberto Marras

arXiv:2509.07135v16.71 citationsh-index: 17Has CodeCLiC-it

Originality Synthesis-oriented

AI Analysis

This provides a crucial resource for the Italian NLP community and EdTech developers, offering standardized evaluation for a critical domain, though it is incremental as it extends existing benchmarking efforts to a new language and domain.

The authors tackled the scarcity of benchmarks for non-English specialized domains by introducing MedBench-IT, the first comprehensive benchmark for evaluating large language models on Italian medical entrance examinations, achieving 88.86% response consistency in reproducibility tests and identifying a small inverse correlation between question readability and model performance.

Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.

View on arXiv PDF

Similar