CL AIMay 28

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

arXiv:2605.3029579.0

AI Analysis

For researchers evaluating LLMs in clinical decision support, this work provides a pipeline and dataset that better reflects real-world EHR data formats, revealing performance gaps not captured by existing benchmarks.

The authors introduce a pipeline to generate clinically realistic HL7 FHIR R4 bundles from unstructured text, creating the MedCase-Structured dataset. They find that LLMs achieve lower diagnostic accuracy on structured FHIR inputs (82.5% valid FHIR generation) compared to plain text, highlighting the need for deployment-aligned benchmarking.

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

View on arXiv PDF

Similar