HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images
This addresses the problem of improving diagnostic accuracy and interpretability for breast cancer detection in clinical settings, though it is incremental as it builds on existing hybrid architectures.
The paper tackled breast lesion segmentation and classification in ultrasound images by proposing HyFormer-Net, a hybrid CNN-Transformer model that achieved a Dice Score of 0.761 and accuracy of 93.2% on the BUSI dataset, with ensemble modeling further improving performance to 90.2% Dice and 99.5% accuracy.
B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.