AILGMay 23

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

arXiv:2605.2469964.4
Predicted impact top 58% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For clinical AI evaluation, this work demonstrates that agentic architecture can yield larger gains than prompt engineering, but also highlights grader variability as a key evaluation concern.

MDIA, a multi-agent diagnostic pipeline, achieves 0.6272 on HealthBench Professional using GPT-5.4, outperforming ChatGPT for Clinicians by 3.72 percentage points, with performance attributed to architectural design rather than prompt engineering.

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes