MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models
This work addresses the problem of developing clinically aligned medical vision-language models for medical imaging, representing an incremental advancement with a novel method for a known bottleneck.
The paper tackled the challenge of bridging clinical diagnostic reasoning with AI in medical imaging by introducing MedCLM, an automated pipeline that converts detection datasets into medical VQA data with Chain-of-Thought reasoning, achieving state-of-the-art performance on several medical VQA benchmarks.
Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.