CVFeb 26

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

arXiv:2602.22959v12.81 citationsh-index: 13

Originality Incremental advance

AI Analysis

This paper addresses the challenging problem of distinguishing visually hard-to-separate diseases for medical diagnosis, an underexplored but clinically significant area for agent-based systems.

This study investigates the ability of agent-based systems to distinguish visually similar diseases in a zero-shot setting, focusing on melanoma vs. atypical nevus and pulmonary edema vs. pneumonia. The proposed multi-agent framework with contrastive adjudication achieved an 11-percentage-point gain in accuracy on dermoscopy data, though overall performance is not yet sufficient for clinical use.

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

View on arXiv PDF

Similar