AIMay 11

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

arXiv:2605.1028683.3

Predicted impact top 30% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For healthcare AI researchers, this work provides a benchmark and highlights the need for improved multi-agent collaboration in clinical decision support.

This paper evaluates LLM-based agents for multimodal clinical prediction tasks using real-world data, finding that single-agent frameworks outperform naive multi-agent systems in handling heterogeneous inputs and calibration.

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

View on arXiv PDF

Similar