Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography
This work addresses the challenge of subjective breast density assessment for cancer risk in radiology, but it is incremental as it compares existing methods on medical imaging data.
The study compared multimodal and CNN-based methods for automated breast density classification in mammography, finding that a fine-tuned ConvNeXt model outperformed BioMedCLIP linear probing, with zero-shot classification showing modest performance.
Mammographic breast density classification is essential for cancer risk assessment but remains challenging due to subjective interpretation and inter-observer variability. This study compares multimodal and CNN-based methods for automated classification using the BI-RADS system, evaluating BioMedCLIP and ConvNeXt across three learning scenarios: zero-shot classification, linear probing with textual descriptions, and fine-tuning with numerical labels. Results show that zero-shot classification achieved modest performance, while the fine-tuned ConvNeXt model outperformed the BioMedCLIP linear probe. Although linear probing demonstrated potential with pretrained embeddings, it was less effective than full fine-tuning. These findings suggest that despite the promise of multimodal learning, CNN-based models with end-to-end fine-tuning provide stronger performance for specialized medical imaging. The study underscores the need for more detailed textual representations and domain-specific adaptations in future radiology applications.