CVAIOct 1, 2025

Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

arXiv:2510.00411v312 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the challenge of leveraging zero-shot vision-language models for medical imaging diagnosis, showing that calibration can make them competitive with supervised models, which is incremental but practical for healthcare applications.

This paper tackles the problem of accurately interpreting chest radiographs for medical diagnosis by comparing a supervised lightweight CNN with a zero-shot medical VLM, BiomedCLIP, on pneumonia and tuberculosis detection tasks. The result shows that after decision threshold calibration, the VLM achieves an F1-score of 0.8841 for pneumonia detection, surpassing the CNN's 0.8803, and improves from 0.4812 to 0.7684 for tuberculosis detection, close to the CNN's 0.7834.

The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes