AIDec 11, 2025

CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

arXiv:2512.10206v22 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the need for better evaluation of medical AI agents in complex, real-world healthcare settings, though it is incremental as it builds on existing agentic environment frameworks.

The authors tackled the problem of evaluating large language models (LLMs) in dynamic clinical scenarios by introducing CP-Env, a controllable hospital environment that simulates end-to-end clinical pathways, and found that most models struggle with pathway complexity, exhibiting hallucinations and losing diagnostic details.

Medical care follows complex clinical pathways that extend beyond isolated physician-patient encounters, emphasizing decision-making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP-Env, a controllable agentic hospital environment designed to evaluate LLMs across end-to-end clinical pathways. CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution. We propose a three-tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL-MED/CP_ENV.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes