Benchmarking Ultrasound Foundation Models for Fetal Plane Classification

Leya Barrientos, Yuexi Du, Nicha C. Dvornek

arXiv:2605.2779656.0h-index: 23

Predicted impact top 21% in IV · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners in fetal ultrasound analysis, this benchmark provides guidance on selecting pretrained models for classification tasks with limited labeled data.

This paper benchmarks ultrasound foundation models for fetal plane classification, finding that FetalCLIP achieves best linear probing results (F1=0.9261 in-domain, 0.9731 out-of-domain) and USFM best full fine-tuning results (F1=0.9476 in-domain, 0.9515 out-of-domain), while MOFO and UltraSAM underperform.

Ultrasound is widely used in obstetric care due to its safety, accessibility, and real-time imaging. However, interpretation remains operator-dependent and susceptible to noise and artifacts. Deep learning models have shown strong performance to solve these problem, but they typically require large annotated datasets that are difficult to obtain in clinical ultrasound. Foundation models (FMs) offer an alternative, using a large number of ultrasound images to learn transferable representations that can generalize with limited labeled data. This work presents a comprehensive benchmark of ultrasound-specific FMs for fetal plane classification. We evaluated four ultrasound FMs (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a ViT (DINOv3) pretrained on natural images. We trained all models under two complementary settings: full fine-tuning and linear probing with a frozen encoder. All models were trained using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset and tested on both in-domain data and an external African cohort to assess cross-population generalization. We found that FetalCLIP achieved the best results in the linear probing setting (F1 = 0.9261 for in-domain, F1 = 0.9731 for out-of-domain), while USFM performed best in the full fine-tuning setting (F1 = 0.9476 for in-domain, F1 = 0.9515 for out-of-domain). MOFO and UltraSAM degraded most in both settings, underperforming natural image pretrained models in some cases. These findings highlight how the choice of pretrained model strongly affects fetal plane classification performance, since different pretraining objectives lead to different levels of transferability.

View on arXiv PDF

Similar