Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
This addresses the challenge of accurate and explainable AI-assisted diagnosis in telemedicine, though it is incremental as it builds on existing models with architectural enhancements.
The study tackled the problem of limited context in telemedicine dermatological care by testing vision-language models on medical visual question answering, finding that clinical-inspired multi-agent reasoning and retrieval-augmented architectures achieved up to 70% accuracy while maintaining performance on unseen data.
Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance in four of seven models with an average 30% decrease, baseline models collapsed on test data. Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy, maintaining performance on unseen data while generating explainable, literature-grounded outputs critical for clinical adoption. These findings demonstrate that medical AI succeeds by reconstructing the collaborative and evidence-based practices fundamental to clinical diagnosis.