CLSep 8, 2025

MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

arXiv:2509.07135v11 citationsh-index: 17Has CodeCLiC-it
Originality Synthesis-oriented
AI Analysis

This provides a crucial resource for the Italian NLP community and EdTech developers, offering standardized evaluation for a critical domain, though it is incremental as it extends existing benchmarking efforts to a new language and domain.

The authors tackled the scarcity of benchmarks for non-English specialized domains by introducing MedBench-IT, the first comprehensive benchmark for evaluating large language models on Italian medical entrance examinations, achieving 88.86% response consistency in reproducibility tests and identifying a small inverse correlation between question readability and model performance.

Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes