CLAIJan 7

Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

arXiv:2601.03627v2h-index: 4Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better evaluation of LLMs in real-world clinical applications, though it is incremental as it focuses on benchmarking and dataset creation.

The paper tackles the problem of evaluating large language models (LLMs) for pre-consultation tasks in clinical settings by introducing the EPAG benchmark, finding that fine-tuned small open-source models can outperform frontier LLMs and that increased patient history does not always improve diagnostic performance.

We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes