CVJun 20, 2025

Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation

Julio Silva-Rodríguez, Fereshteh Shakeri, Houda Bahig, Jose Dolz, Ismail Ben Ayed

arXiv:2506.17500v13 citationsh-index: 50MICCAI

Originality Incremental advance

AI Analysis

This work addresses the problem of deploying medical VLMs in real-world, imbalanced scenarios without validation data, offering a more practical solution for healthcare applications.

The paper tackles the unrealistic assumptions of balanced support sets and validation data in few-shot adaptation of vision-language models for medical imaging, showing that current methods often fail under realistic conditions. It introduces a training-free linear probe that adaptively blends visual and textual supervision, achieving robust adaptation across various modalities and tasks.

Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.

View on arXiv PDF

Similar