CV LGMay 4

Ultrasound Vision-Language Alignment via Contrastive Learning

Zhuoyang Lyu, Yiyang Zhang, Tongxin Wang, Ruirui Lan

arXiv:2605.021267.6

AI Analysis

This work enables zero-shot and few-shot transfer for ultrasound analysis, addressing the scarcity of task-specific annotations, though results are dataset-dependent and incremental over existing CLIP baselines.

EchoCare-CLIP, a CLIP-style contrastive framework, aligns ultrasound images with clinical text using over 16K image-text pairs, achieving a paired alignment score of 0.682 and zero-shot classification accuracy of 0.709 on BUSI and 0.626 on AULI, but full fine-tuning degrades transfer due to overfitting.

Ultrasound foundation models have achieved strong performance on structured prediction tasks but remain exclusively vision-based, limiting zero-shot and few-shot transfer to novel tasks where task-specific annotation is scarce. We address this gap with EchoCare-CLIP, a CLIP-style dual-encoder contrastive framework that aligns ultrasound images with clinical text in a shared embedding space. We curate a multi-organ corpus of over 16K image-text pairs spanning breast, liver, lung, and thyroid, with over 78% of captions derived from expert-annotated reports, and complement the remainder with a three-tier template-based and LLM-based caption generation pipeline. We evaluate model configurations spanning two text encoder families (CLIP, BioClinicalBERT) and two caption strategies (template-based, LLM-generated) against OpenAI CLIP and BiomedCLIP baselines. Our trained models consistently improve cross-modal alignment over baselines, with the best configuration achieving a paired alignment score of 0.682. However, stronger alignment does not guarantee better downstream performance: CLIP-based variants with partial fine-tuning achieve the strongest zero-shot classification on external held-out datasets (0.709 on BUSI; 0.626 on AULI), while full end-to-end fine-tuning degrades transfer due to overfitting. On linear probing and few-shot adaptation, model rankings are dataset-dependent, reflecting a trade-off between domain adaptation and representational generalizability. We further show that template-based captions match or outperform LLM-generated captions, suggesting lexical diversity is not a proxy for caption quality. Taken together, our results demonstrate that ultrasound vision-language alignment is achievable from public data alone, but robust clinical transfer requires careful balancing of domain adaptation, encoder capacity, and caption supervision quality.

View on arXiv PDF

Similar