CVMar 1

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

arXiv:2603.01301v1h-index: 5
Originality Incremental advance
AI Analysis

This work clarifies the incremental role of RL in medical VLMs, aiding researchers in optimizing training pipelines for medical visual reasoning tasks.

The study investigated whether reinforcement learning (RL) improves medical vision-language models (VLMs) beyond supervised fine-tuning (SFT), finding that RL primarily sharpens output distributions to boost accuracy and sampling efficiency when models already have high support, while SFT expands support to enable RL gains.

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes