CLAIOct 12, 2025

LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

AI2UW
arXiv:2510.10415v11 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the challenge of reliable evaluation in resource-constrained clinical QA settings, offering incremental improvements in efficiency and consistency.

The paper tackled the problem of evaluating long-form clinical QA systems by introducing LongQAEval, a framework that compares coarse and fine-grained evaluation methods, finding that inter-annotator agreement varies by dimension and that annotating a subset of sentences can reduce costs while maintaining reliability.

Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes