Julio Silva-Rodríguez

CV
h-index50
24papers
582citations
Novelty51%
AI Score56

24 Papers

CVSep 5, 2024Code
Few-shot Adaptation of Medical Vision-Language Models

Fereshteh Shakeri, Yunshi Huang, Julio Silva-Rodríguez et al.

Integrating image and text data through multi-modal learning has emerged as a new approach in medical imaging research, following its successful deployment in computer vision. While considerable efforts have been dedicated to establishing medical foundation models and their zero-shot transfer to downstream tasks, the popular few-shot setting remains relatively unexplored. Following on from the currently strong emergence of this setting in computer vision, we introduce the first structured benchmark for adapting medical vision-language models (VLMs) in a strict few-shot regime and investigate various adaptation strategies commonly used in the context of natural images. Furthermore, we evaluate a simple generalization of the linear-probe adaptation baseline, which seeks an optimal blending of the visual prototypes and text embeddings via learnable class-wise multipliers. Surprisingly, such a text-informed linear probe yields competitive performances in comparison to convoluted prompt-learning and adapter-based strategies, while running considerably faster and accommodating the black-box setting. Our extensive experiments span three different medical modalities and specialized foundation models, nine downstream tasks, and several state-of-the-art few-shot adaptation methods. We made our benchmark and code publicly available to trigger further developments in this emergent subject: \url{https://github.com/FereshteShakeri/few-shot-MedVLMs}.

CVAug 15, 2023
A Foundation Language-Image Model of the Retina (FLAIR): Encoding Expert Knowledge in Text Supervision

Julio Silva-Rodríguez, Hadi Chakor, Riadh Kobbi et al.

Foundation vision-language models are currently transforming computer vision, and are on the rise in medical imaging fueled by their very promising generalization capabilities. However, the initial attempts to transfer this new paradigm to medical imaging have shown less impressive performances than those observed in other domains, due to the significant domain shift and the complex, expert domain knowledge inherent to medical-imaging tasks. Motivated by the need for domain-expert foundation models, we present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. To this end, we compiled 38 open-access, mostly categorical fundus imaging datasets from various sources, with up to 101 different target conditions and 288,307 images. We integrate the expert's domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference, enhancing the less-informative categorical supervision of the data. Such a textual expert's knowledge, which we compiled from the relevant clinical literature and community standards, describes the fine-grained features of the pathologies as well as the hierarchies and dependencies between them. We report comprehensive evaluations, which illustrate the benefit of integrating expert knowledge and the strong generalization capabilities of FLAIR under difficult scenarios with domain shifts or unseen categories. When adapted with a lightweight linear probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the few-shot regimes. Interestingly, FLAIR outperforms by a wide margin larger-scale generalist image-language models and retina domain-specific self-supervised networks, which emphasizes the potential of embedding experts' domain knowledge and the limitations of generalist models in medical imaging.

CVMar 29, 2023
Towards Foundation Models and Few-Shot Parameter-Efficient Fine-Tuning for Volumetric Organ Segmentation

Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed

The recent popularity of foundation models and the pre-train-and-adapt paradigm, where a large-scale model is transferred to downstream tasks, is gaining attention for volumetric medical image segmentation. However, current transfer learning strategies devoted to full fine-tuning for transfer learning may require significant resources and yield sub-optimal results when the labeled data of the target task is scarce. This makes its applicability in real clinical settings challenging since these institutions are usually constrained on data and computational resources to develop proprietary solutions. To address this challenge, we formalize Few-Shot Efficient Fine-Tuning (FSEFT), a novel and realistic scenario for adapting medical image segmentation foundation models. This setting considers the key role of both data- and parameter-efficiency during adaptation. Building on a foundation model pre-trained on open-access CT organ segmentation sources, we propose leveraging Parameter-Efficient Fine-Tuning and black-box Adapters to address such challenges. Furthermore, novel efficient adaptation methodologies are introduced in this work, which include Spatial black-box Adapters that are more appropriate for dense prediction tasks and constrained transductive inference, leveraging task-specific prior knowledge. Our comprehensive transfer learning experiments confirm the suitability of foundation models in medical image segmentation and unveil the limitations of popular fine-tuning strategies in few-shot scenarios.

IVMar 3, 2022
Constrained unsupervised anomaly segmentation

Julio Silva-Rodríguez, Valery Naranjo, Jose Dolz

Current unsupervised anomaly localization approaches rely on generative models to learn the distribution of normal images, which is later used to identify potential anomalous regions derived from errors on the reconstructed images. However, a main limitation of nearly all prior literature is the need of employing anomalous images to set a class-specific threshold to locate the anomalies. This limits their usability in realistic scenarios, where only normal data is typically accessible. Despite this major drawback, only a handful of works have addressed this limitation, by integrating supervision on attention maps during training. In this work, we propose a novel formulation that does not require accessing images with abnormalities to define the threshold. Furthermore, and in contrast to very recent work, the proposed constraint is formulated in a more principled manner, leveraging well-known knowledge in constrained optimization. In particular, the equality constraint on the attention maps in prior work is replaced by an inequality constraint, which allows more flexibility. In addition, to address the limitations of penalty-based functions we employ an extension of the popular log-barrier methods to handle the constraint. Last, we propose an alternative regularization term that maximizes the Shannon entropy of the attention maps, reducing the amount of hyperparameters of the proposed model. Comprehensive experiments on two publicly available datasets on brain lesion segmentation demonstrate that the proposed approach substantially outperforms relevant literature, establishing new state-of-the-art results for unsupervised lesion segmentation, and without the need to access anomalous images.

IVNov 30, 2022
Challenging mitosis detection algorithms: Global labels allow centroid localization

Claudio Fernandez-Martín, Umay Kiraz, Julio Silva-Rodríguez et al.

Mitotic activity is a crucial proliferation biomarker for the diagnosis and prognosis of different types of cancers. Nevertheless, mitosis counting is a cumbersome process for pathologists, prone to low reproducibility, due to the large size of augmented biopsy slides, the low density of mitotic cells, and pattern heterogeneity. To improve reproducibility, deep learning methods have been proposed in the last years using convolutional neural networks. However, these methods have been hindered by the process of data labelling, which usually solely consist of the mitosis centroids. Therefore, current literature proposes complex algorithms with multiple stages to refine the labels at pixel level, and to reduce the number of false positives. In this work, we propose to avoid complex scenarios, and we perform the localization task in a weakly supervised manner, using only image-level labels on patches. The results obtained on the publicly available TUPAC16 dataset are competitive with state-of-the-art methods, using only one training phase. Our method achieves an F1-score of 0.729 and challenges the efficiency of previous methods, which required multiple stages and strong mitosis location information.

HCJul 11, 2023
HistoColAi: An Open-Source Web Platform for Collaborative Digital Histology Image Annotation with AI-Driven Predictive Integration

Cristian Camilo Pulgarín-Ospina, Rocío del Amor, Adrián Colomera et al.

Digital pathology has become a standard in the pathology workflow due to its many benefits. These include the level of detail of the whole slide images generated and the potential immediate sharing of cases between hospitals. Recent advances in deep learning-based methods for image analysis make them of potential aid in digital pathology. However, a major limitation in developing computer-aided diagnostic systems for pathology is the lack of an intuitive and open web application for data annotation. This paper proposes a web service that efficiently provides a tool to visualize and annotate digitized histological images. In addition, to show and validate the tool, in this paper we include a use case centered on the diagnosis of spindle cell skin neoplasm for multiple annotators. A usability study of the tool is also presented, showing the feasibility of the developed tool.

CVJul 10, 2025Code
ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez et al.

Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.

CVJul 21, 2025Code
Regularized Low-Rank Adaptation for Few-Shot Organ Segmentation

Ghassen Baklouti, Julio Silva-Rodríguez, Jose Dolz et al.

Parameter-efficient fine-tuning (PEFT) of pre-trained foundation models is increasingly attracting interest in medical imaging due to its effectiveness and computational efficiency. Among these methods, Low-Rank Adaptation (LoRA) is a notable approach based on the assumption that the adaptation inherently occurs in a low-dimensional subspace. While it has shown good performance, its implementation requires a fixed and unalterable rank, which might be challenging to select given the unique complexities and requirements of each medical imaging downstream task. Inspired by advancements in natural image processing, we introduce a novel approach for medical image segmentation that dynamically adjusts the intrinsic rank during adaptation. Viewing the low-rank representation of the trainable weight matrices as a singular value decomposition, we introduce an l_1 sparsity regularizer to the loss function, and tackle it with a proximal optimizer. The regularizer could be viewed as a penalty on the decomposition rank. Hence, its minimization enables to find task-adapted ranks automatically. Our method is evaluated in a realistic few-shot fine-tuning setting, where we compare it first to the standard LoRA and then to several other PEFT methods across two distinguishable tasks: base organs and novel organs. Our extensive experiments demonstrate the significant performance improvements driven by our method, highlighting its efficiency and robustness against suboptimal rank initialization. Our code is publicly available: https://github.com/ghassenbaklouti/ARENA

CVDec 20, 2023
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models

Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed et al.

Efficient transfer learning (ETL) is receiving increasing attention to adapt large pre-trained language-vision models on downstream tasks with a few labeled samples. While significant progress has been made, we reveal that state-of-the-art ETL approaches exhibit strong performance only in narrowly-defined experimental setups, and with a careful adjustment of hyperparameters based on a large corpus of labeled samples. In particular, we make two interesting, and surprising empirical observations. First, to outperform a simple Linear Probing baseline, these methods require to optimize their hyper-parameters on each target task. And second, they typically underperform -- sometimes dramatically -- standard zero-shot predictions in the presence of distributional drifts. Motivated by the unrealistic assumptions made in the existing literature, i.e., access to a large validation set and case-specific grid-search for optimal hyperparameters, we propose a novel approach that meets the requirements of real-world scenarios. More concretely, we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing term is optimized via an adaptation of the general Augmented Lagrangian method tailored to this context. We comprehensively evaluate CLAP on a broad span of datasets and scenarios, demonstrating that it consistently outperforms SoTA approaches, while yet being a much more efficient alternative.

CVApr 20
Trustworthy Endoscopic Super-Resolution

Julio Silva-Rodríguez, Ender Konukoglu

Super-resolution (SR) models are attracting growing interest for enhancing minimally invasive surgery and diagnostic videos under hardware constraints. However, valid concerns remain regarding the introduction of hallucinated structures and amplified noise, limiting their reliability in safety-critical settings. We propose a direct and practical framework to make SR systems more trustworthy by identifying where reconstructions are likely to fail. Our approach integrates a lightweight error-prediction network that operates on intermediate representations to estimate pixel-wise reconstruction error. The module is computationally efficient and low-latency, making it suitable for real-time deployment. We convert these predictions into operational failure decisions by constructing Conformal Failure Masks (CFM), which localize regions where the SR output should not be trusted. Built on conformal risk control principles, our method provides theoretical guarantees for controlling both the tolerated error limit and the miscoverage in detected failures. We evaluate our approach on image and video SR, demonstrating its effectiveness in detecting unreliable reconstructions in endoscopic and robotic surgery settings. To our knowledge, this is the first study to provide a model-agnostic, theoretically grounded approach to improving the safety of real-time endoscopic image SR.

CVDec 8, 2024
Are foundation models for computer vision good conformal predictors?

Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed et al.

Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.

CVMay 30, 2025
Conformal Prediction for Zero-Shot Models

Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz

Vision-language models pre-trained at large scale have shown unprecedented adaptability and generalization to downstream tasks. Although its discriminative potential has been widely explored, its reliability and uncertainty are still overlooked. In this work, we investigate the capabilities of CLIP models under the split conformal prediction paradigm, which provides theoretical guarantees to black-box models based on a small, labeled calibration set. In contrast to the main body of literature on conformal predictors in vision classifiers, foundation models exhibit a particular characteristic: they are pre-trained on a one-time basis on an inaccessible source domain, different from the transferred task. This domain drift negatively affects the efficiency of the conformal sets and poses additional challenges. To alleviate this issue, we propose Conf-OT, a transfer learning setting that operates transductive over the combined calibration and query sets. Solving an optimal transport problem, the proposed method bridges the domain gap between pre-training and adaptation without requiring additional data splits but still maintaining coverage guarantees. We comprehensively explore this conformal prediction strategy on a broad span of 15 datasets and three non-conformity scores. Conf-OT provides consistent relative improvements of up to 20% on set efficiency while being 15 times faster than popular transductive approaches.

CVMar 3
Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Julio Silva-Rodríguez, Ender Konukoglu

Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

CVJun 20, 2025
Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction

Julio Silva-Rodríguez, Ismail Ben Ayed, Jose Dolz

Medical vision-language models (VLMs) have demonstrated unprecedented transfer capabilities and are being increasingly adopted for data-efficient image classification. Despite its growing popularity, its reliability aspect remains largely unexplored. This work explores the split conformal prediction (SCP) framework to provide trustworthiness guarantees when transferring such models based on a small labeled calibration set. Despite its potential, the generalist nature of the VLMs' pre-training could negatively affect the properties of the predicted conformal sets for specific tasks. While common practice in transfer learning for discriminative purposes involves an adaptation stage, we observe that deploying such a solution for conformal purposes is suboptimal since adapting the model using the available calibration data breaks the rigid exchangeability assumptions for test data in SCP. To address this issue, we propose transductive split conformal adaptation (SCA-T), a novel pipeline for transfer learning on conformal scenarios, which performs an unsupervised transductive adaptation jointly on calibration and test data. We present comprehensive experiments utilizing medical VLMs across various image modalities, transfer tasks, and non-conformity scores. Our framework offers consistent gains in efficiency and conditional coverage compared to SCP, maintaining the same empirical guarantees.

CVJun 20, 2025
Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation

Julio Silva-Rodríguez, Fereshteh Shakeri, Houda Bahig et al.

Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.

CVApr 7, 2025
A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?

Julio Silva-Rodríguez, Jose Dolz, Ismail Ben Ayed

Vision-language pre-training has recently gained popularity as it allows learning rich feature representations using large-scale data sources. This paradigm has quickly made its way into the medical image analysis community. In particular, there is an impressive amount of recent literature developing vision-language models for radiology. However, the available medical datasets with image-text supervision are scarce, and medical concepts are fine-grained, involving expert knowledge that existing vision-language models struggle to encode. In this paper, we propose to take a prudent step back from the literature and revisit supervised, unimodal pre-training, using fine-grained labels instead. We conduct an extensive comparison demonstrating that unimodal pre-training is highly competitive and better suited to integrating heterogeneous data sources. Our results also question the potential of recent vision-language models for open-vocabulary generalization, which have been evaluated using optimistic experimental settings. Finally, we study novel alternatives to better integrate fine-grained labels and noisy text supervision.

CVMay 24, 2024
Self-Contrastive Weakly Supervised Learning Framework for Prognostic Prediction Using Whole Slide Images

Saul Fuster, Farbod Khoraminia, Julio Silva-Rodríguez et al.

We present a pioneering investigation into the application of deep learning techniques to analyze histopathological images for addressing the substantial challenge of automated prognostic prediction. Prognostic prediction poses a unique challenge as the ground truth labels are inherently weak, and the model must anticipate future events that are not directly observable in the image. To address this challenge, we propose a novel three-part framework comprising of a convolutional network based tissue segmentation algorithm for region of interest delineation, a contrastive learning module for feature extraction, and a nested multiple instance learning classification module. Our study explores the significance of various regions of interest within the histopathological slides and exploits diverse learning scenarios. The pipeline is initially validated on artificially generated data and a simpler diagnostic task. Transitioning to prognostic prediction, tasks become more challenging. Employing bladder cancer as use case, our best models yield an AUC of 0.721 and 0.678 for recurrence and treatment outcome prediction respectively.

CVJan 11, 2024
Attention to detail: inter-resolution knowledge distillation

Rocío del Amor, Julio Silva-Rodríguez, Adrián Colomer et al.

The development of computer vision solutions for gigapixel images in digital pathology is hampered by significant computational limitations due to the large size of whole slide images. In particular, digitizing biopsies at high resolutions is a time-consuming process, which is necessary due to the worsening results from the decrease in image detail. To alleviate this issue, recent literature has proposed using knowledge distillation to enhance the model performance at reduced image resolutions. In particular, soft labels and features extracted at the highest magnification level are distilled into a model that takes lower-magnification images as input. However, this approach fails to transfer knowledge about the most discriminative image regions in the classification process, which may be lost when the resolution is decreased. In this work, we propose to distill this information by incorporating attention maps during training. In particular, our formulation leverages saliency maps of the target class via grad-CAMs, which guides the lower-resolution Student model to match the Teacher distribution by minimizing the l2 distance between them. Comprehensive experiments on prostate histology image grading demonstrate that the proposed approach substantially improves the model performance across different image resolutions compared to previous literature.

IVSep 1, 2021
Looking at the whole picture: constrained unsupervised anomaly segmentation

Julio Silva-Rodríguez, Valery Naranjo, Jose Dolz

Current unsupervised anomaly localization approaches rely on generative models to learn the distribution of normal images, which is later used to identify potential anomalous regions derived from errors on the reconstructed images. However, a main limitation of nearly all prior literature is the need of employing anomalous images to set a class-specific threshold to locate the anomalies. This limits their usability in realistic scenarios, where only normal data is typically accessible. Despite this major drawback, only a handful of works have addressed this limitation, by integrating supervision on attention maps during training. In this work, we propose a novel formulation that does not require accessing images with abnormalities to define the threshold. Furthermore, and in contrast to very recent work, the proposed constraint is formulated in a more principled manner, leveraging well-known knowledge in constrained optimization. In particular, the equality constraint on the attention maps in prior work is replaced by an inequality constraint, which allows more flexibility. In addition, to address the limitations of penalty-based functions we employ an extension of the popular log-barrier methods to handle the constraint. Comprehensive experiments on the popular BRATS'19 dataset demonstrate that the proposed approach substantially outperforms relevant literature, establishing new state-of-the-art results for unsupervised lesion segmentation.

IVMay 21, 2021
Prostate Gland Segmentation in Histology Images via Residual and Multi-Resolution U-Net

Julio Silva-Rodríguez, Elena Payá-Bosch, Gabriel García et al.

Prostate cancer is one of the most prevalent cancers worldwide. One of the key factors in reducing its mortality is based on early detection. The computer-aided diagnosis systems for this task are based on the glandular structural analysis in histology images. Hence, accurate gland detection and segmentation is crucial for a successful prediction. The methodological basis of this work is a prostate gland segmentation based on U-Net convolutional neural network architectures modified with residual and multi-resolution blocks, trained using data augmentation techniques. The residual configuration outperforms in the test subset the previous state-of-the-art approaches in an image-level comparison, reaching an average Dice Index of 0.77.

IVMay 21, 2021
Going Deeper through the Gleason Scoring Scale: An Automatic end-to-end System for Histology Prostate Grading and Cribriform Pattern Detection

Julio Silva-Rodríguez, Adrián Colomer, María A. Sales et al.

The Gleason scoring system is the primary diagnostic and prognostic tool for prostate cancer. In recent years, with the development of digitisation devices, the use of computer vision techniques for the analysis of biopsies has increased. However, to the best of the authors' knowledge, the development of algorithms to automatically detect individual cribriform patterns belonging to Gleason grade 4 has not yet been studied in the literature. The objective of the work presented in this paper is to develop a deep-learning-based system able to support pathologists in the daily analysis of prostate biopsies. The methodological core of this work is a patch-wise predictive model based on convolutional neural networks able to determine the presence of cancerous patterns. In particular, we train from scratch a simple self-design architecture. The cribriform pattern is detected by retraining the set of filters of the last convolutional layer in the network. From the reconstructed prediction map, we compute the percentage of each Gleason grade in the tissue to feed a multi-layer perceptron which provides a biopsy-level score.mIn our SICAPv2 database, composed of 182 annotated whole slide images, we obtained a Cohen's quadratic kappa of 0.77 in the test set for the patch-level Gleason grading with the proposed architecture trained from scratch. Our results outperform previous ones reported in the literature. Furthermore, this model reaches the level of fine-tuned state-of-the-art architectures in a patient-based four groups cross validation. In the cribriform pattern detection task, we obtained an area under ROC curve of 0.82. Regarding the biopsy Gleason scoring, we achieved a quadratic Cohen's Kappa of 0.81 in the test subset. Shallow CNN architectures trained from scratch outperform current state-of-the-art methods for Gleason grades classification.

IVMay 21, 2021
WeGleNet: A Weakly-Supervised Convolutional Neural Network for the Semantic Segmentation of Gleason Grades in Prostate Histology Images

Julio Silva-Rodríguez, Adrián Colomer, Valery Naranjo

Prostate cancer is one of the main diseases affecting men worldwide. The Gleason scoring system is the primary diagnostic tool for prostate cancer. This is obtained via the visual analysis of cancerous patterns in prostate biopsies performed by expert pathologists, and the aggregation of the main Gleason grades in a combined score. Computer-aided diagnosis systems allow to reduce the workload of pathologists and increase the objectivity. Recently, efforts have been made in the literature to develop algorithms aiming the direct estimation of the global Gleason score at biopsy/core level with global labels. However, these algorithms do not cover the accurate localization of the Gleason patterns into the tissue. In this work, we propose a deep-learning-based system able to detect local cancerous patterns in the prostate tissue using only the global-level Gleason score during training. The methodological core of this work is the proposed weakly-supervised-trained convolutional neural network, WeGleNet, based on a multi-class segmentation layer after the feature extraction module, a global-aggregation, and the slicing of the background class for the model loss estimation during training. We obtained a Cohen's quadratic kappa (k) of 0.67 for the pixel-level prediction of cancerous patterns in the validation cohort. We compared the model performance for semantic segmentation of Gleason grades with supervised state-of-the-art architectures in the test cohort. We obtained a pixel-level k of 0.61 and a macro-averaged f1-score of 0.58, at the same level as fully-supervised methods. Regarding the estimation of the core-level Gleason score, we obtained a k of 0.76 and 0.67 between the model and two different pathologists. WeGleNet is capable of performing the semantic segmentation of Gleason grades similarly to fully-supervised methods without requiring pixel-level annotations.

IVMay 21, 2021
Self-learning for weakly supervised Gleason grading of local patterns

Julio Silva-Rodríguez, Adrián Colomer, Jose Dolz et al.

Prostate cancer is one of the main diseases affecting men worldwide. The gold standard for diagnosis and prognosis is the Gleason grading system. In this process, pathologists manually analyze prostate histology slides under microscope, in a high time-consuming and subjective task. In the last years, computer-aided-diagnosis (CAD) systems have emerged as a promising tool that could support pathologists in the daily clinical practice. Nevertheless, these systems are usually trained using tedious and prone-to-error pixel-level annotations of Gleason grades in the tissue. To alleviate the need of manual pixel-wise labeling, just a handful of works have been presented in the literature. Motivated by this, we propose a novel weakly-supervised deep-learning model, based on self-learning CNNs, that leverages only the global Gleason score of gigapixel whole slide images during training to accurately perform both, grading of patch-level patterns and biopsy-level scoring. To evaluate the performance of the proposed method, we perform extensive experiments on three different external datasets for the patch-level Gleason grading, and on two different test sets for global Grade Group prediction. We empirically demonstrate that our approach outperforms its supervised counterpart on patch-level Gleason grading by a large margin, as well as state-of-the-art methods on global biopsy-level scoring. Particularly, the proposed model brings an average improvement on the Cohen's quadratic kappa (k) score of nearly 18% compared to full-supervision for the patch-level Gleason grading task.

CVMay 22, 2020
Gleason Grading of Histology Prostate Images through Semantic Segmentation via Residual U-Net

Amartya Kalapahar, Julio Silva-Rodríguez, Adrián Colomer et al.

Worldwide, prostate cancer is one of the main cancers affecting men. The final diagnosis of prostate cancer is based on the visual detection of Gleason patterns in prostate biopsy by pathologists. Computer-aided-diagnosis systems allow to delineate and classify the cancerous patterns in the tissue via computer-vision algorithms in order to support the physicians' task. The methodological core of this work is a U-Net convolutional neural network for image segmentation modified with residual blocks able to segment cancerous tissue according to the full Gleason system. This model outperforms other well-known architectures, and reaches a pixel-level Cohen's quadratic Kappa of 0.52, at the level of previous image-level works in the literature, but providing also a detailed localisation of the patterns.