3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
This work addresses the problem of evaluating LVLMs in realistic medical consultations for researchers and practitioners, though it is incremental as it builds on existing LVLM and benchmark methods.
The paper tackles the underexplored ability of Large Vision-Language Models (LVLMs) in complex telemedicine consultations by introducing 3MDBench, a benchmark for simulating and evaluating LVLM-driven dialogues, showing that multimodal dialogue improves F1 score by 6.5% and integrating a diagnostic CNN boosts F1 by up to 20%.
Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.