CVMay 8

Benchmarking Foundation Models for Renal Lesion Stratification in CT

Hartmut Häntze, Sarah de Boer, Myrthe Buser, Alessa Hering, Bram van Ginneken, Mathias Prokop, Jawed Nawabi, Sebastian Ziegelmayer, Lisa Adams, Keno Bressem

arXiv:2605.0774939.8Has Code

AI Analysis

For researchers and clinicians working on renal lesion classification, this benchmark shows that current medical foundation models are not yet competitive with traditional radiomics, highlighting a gap in representation learning for fine-grained histopathological discrimination.

The authors benchmarked three medical foundation models (FMs) on CT-based renal lesion classification and found they matched a from-scratch 3D ResNet-50 (AUC 0.70-0.77 vs. 0.72) but were significantly outperformed by a handcrafted radiomics classifier (AUC 0.88, p≤0.002), indicating that current FMs do not capture fine-grained texture and shape features needed for this task.

The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p $\leq$ 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.

View on arXiv PDF Code

Similar