CLMar 4

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre

arXiv:2603.04033v10.6h-index: 22

Originality Incremental advance

AI Analysis

This work addresses the challenge of automatic evaluation in French medical OEQA, which typically requires expert annotations, by exploring LLM-as-a-judge for researchers and practitioners in low-resource medical settings.

This paper evaluates the use of Large Language Models (LLMs) as judges for semantic equivalence in French medical open-ended question answering (OEQA), finding that LLM judgments are heavily influenced by the answer-generating model. Domain-adapted and large general-purpose models show the highest alignment with expert annotations, and lightweight adaptation of a compact model significantly improves performance and reduces generator sensitivity.

Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

View on arXiv PDF

Similar