Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering
This task addresses the need for robust evaluation of LLMs in medical QA for general practitioners, but it is incremental as it builds on existing shared task frameworks.
The paper presents the ClinIQLink 2025 shared task, which evaluates large language models on medical question-answering using 4,978 expert-verified questions across seven formats, with automated scoring and physician audits.
In this paper, we present an overview of ClinIQLink, a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4,978 expert-verified, medical source-grounded question-answer pairs that cover seven formats: true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland's Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.