Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
For practitioners using frozen vision foundation models in medical imaging, this work provides evidence-based guidance and a low-regret feature-space selector for noisy-label learning, recasting the problem as regime-aware method selection.
The paper benchmarks eight noisy-label learning methods across 150 conditions in medical imaging with frozen vision foundation models, finding no universal winner; ELR wins most conditions (49/150) while CUFIT achieves best mean rank (2.51), and the cost of method choice increases from 4.5pp to 18.8pp with noise severity. It also reveals that the small-loss assumption fails under frozen features due to overlapping loss distributions (53-61%) and that prediction agreement is more stable than loss ranking (3pp vs 13pp precision drop).
Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $χ^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.