CLSep 29, 2025

The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu

arXiv:2509.24958v212.07 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the need for comprehensive evaluation of medical AI agents' inquiry skills, which is incremental as it builds on existing diagnostic AI work by focusing on overlooked qualities like communication.

The authors tackled the problem of evaluating AI doctor agents' questioning capabilities by introducing MAQuE, a benchmark with 3,000 simulated patient agents, and found that even state-of-the-art models show significant challenges, with diagnostic accuracy highly sensitive to patient behavior variations.

An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

View on arXiv PDF

Similar