CVAug 30, 2022
TCAM: Temporal Class Activation Maps for Object Localization in Weakly-Labeled Unconstrained VideosSoufiane Belharbi, Ismail Ben Ayed, Luke McCaffrey et al.
Weakly supervised video object localization (WSVOL) allows locating object in videos using only global video tags such as object class. State-of-art methods rely on multiple independent stages, where initial spatio-temporal proposals are generated using visual and motion cues, then prominent objects are identified and refined. Localization is done by solving an optimization problem over one or more videos, and video tags are typically used for video clustering. This requires a model per-video or per-class making for costly inference. Moreover, localized regions are not necessary discriminant because of unsupervised motion methods like optical flow, or because video tags are discarded from optimization. In this paper, we leverage the successful class activation mapping (CAM) methods, designed for WSOL based on still images. A new Temporal CAM (TCAM) method is introduced to train a discriminant deep learning (DL) model to exploit spatio-temporal information in videos, using an aggregation mechanism, called CAM-Temporal Max Pooling (CAM-TMP), over consecutive CAMs. In particular, activations of regions of interest (ROIs) are collected from CAMs produced by a pretrained CNN classifier to build pixel-wise pseudo-labels for training the DL model. In addition, a global unsupervised size constraint, and local constraint such as CRF are used to yield more accurate CAMs. Inference over single independent frames allows parallel processing of a clip of frames, and real-time localization. Extensive experiments on two challenging YouTube-Objects datasets for unconstrained videos, indicate that CAM methods (trained on independent frames) can yield decent localization accuracy. Our proposed TCAM method achieves a new state-of-art in WSVOL accuracy, and visual results suggest that it can be adapted for subsequent tasks like visual object tracking and detection. Code is publicly available.
CVMar 16, 2023
CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained VideosSoufiane Belharbi, Shakeeb Murtaza, Marco Pedersoli et al.
Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.
IVMay 12, 2022
Leveraging Uncertainty for Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology ImagesSoufiane Belharbi, Jérôme Rony, Jose Dolz et al.
Trained using only image class label, deep weakly supervised methods allow image classification and ROI segmentation for interpretability. Despite their success on natural images, they face several challenges over histology data where ROI are visually similar to background making models vulnerable to high pixel-wise false positives. These methods lack mechanisms for modeling explicitly non-discriminative regions which raises false-positive rates. We propose novel regularization terms, which enable the model to seek both non-discriminative and discriminative regions, while discouraging unbalanced segmentations and using only image class label. Our method is composed of two networks: a localizer that yields segmentation mask, followed by a classifier. The training loss pushes the localizer to build a segmentation mask that holds most discrimiantive regions while simultaneously modeling background regions. Comprehensive experiments over two histology datasets showed the merits of our method in reducing false positives and accurately segmenting ROI.
55.6LGMar 16
Longitudinal Risk Prediction in Mammography with Privileged History DistillationBanafsheh Karimian, Alexis Guichemerre, Soufiane Belharbi et al.
Breast cancer remains a leading cause of cancer-related mortality worldwide. Longitudinal mammography risk prediction models improve multi-year breast cancer risk prediction based on prior screening exams. However, in real-world clinical practice, longitudinal histories are often incomplete, irregular, or unavailable due to missed screenings, first-time examinations, heterogeneous acquisition schedules, or archival constraints. The absence of prior exams degrades the performance of longitudinal risk models and limits their practical applicability. While substantial longitudinal history is available during training, prior exams are commonly absent at test time. In this paper, we address missing history at inference time and propose a longitudinal risk prediction method that uses mammography history as privileged information during training and distills its prognostic value into a student model that only requires the current exam at inference time. The key idea is a privileged multi-teacher distillation scheme with horizon-specific teachers: each teacher is trained on the full longitudinal history to specialize in one prediction horizon, while the student receives only a reconstructed history derived from the current exam. This allows the student to inherit horizon-dependent longitudinal risk cues without requiring prior screening exams at deployment. Our new Privileged History Distillation (PHD) method is validated on a large longitudinal mammography dataset with multi-year cancer outcomes, CSAW-CC, comparing full-history and no-history baselines to their distilled counterparts. Using time-dependent AUC across horizons, our privileged history distillation method markedly improves the performance of long-horizon prediction over no-history models and is comparable to that of full-history models, while using only the current exam at inference time.
IVJun 13, 2024Code
SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-ResolutionSoufiane Belharbi, Mara KM Whitford, Phuong Hoang et al.
Confocal fluorescence microscopy is one of the most accessible and widely used imaging techniques for the study of biological processes at the cellular and subcellular levels. Scanning confocal microscopy allows the capture of high-quality images from thick three-dimensional (3D) samples, yet suffers from well-known limitations such as photobleaching and phototoxicity of specimens caused by intense light exposure, limiting its applications. Cellular damage can be alleviated by changing imaging parameters to reduce light exposure, often at the expense of image quality. Machine/deep learning methods for single-image super-resolution (SISR) can be applied to restore image quality by upscaling lower-resolution (LR) images to yield high-resolution images (HR). These SISR methods have been successfully applied to photo-realistic images due partly to the abundance of publicly available data. In contrast, the lack of publicly available data partly limits their application and success in scanning confocal microscopy. In this paper, we introduce a large scanning confocal microscopy dataset named SR-CACO-2 that is comprised of low- and high-resolution image pairs marked for three different fluorescent markers. It allows the evaluation of performance of SISR methods on three different upscaling levels (X2, X4, X8). SR-CACO-2 contains the human epithelial cell line Caco-2 (ATCC HTB-37), and it is composed of 2,200 unique images, captured with four resolutions and three markers, forming 9,937 image patches for SISR methods. We provide benchmarking results for 16 state-of-the-art methods of the main SISR families. Results show that these methods have limited success in producing high-resolution textures. The dataset is freely accessible under a Creative Commons license (CC BY-NC-SA 4.0). Our dataset, code and pretrained weights for SISR methods are available: https://github.com/sbelharbi/sr-caco-2.
LGNov 25, 2019Code
Non-parametric Uni-modality Constraints for Deep Ordinal ClassificationSoufiane Belharbi, Ismail Ben Ayed, Luke McCaffrey et al.
We propose a new constrained-optimization formulation for deep ordinal classification, in which uni-modality of the label distribution is enforced implicitly via a set of inequality constraints over all the pairs of adjacent labels. Based on (c-1) constraints for c labels, our model is non-parametric and, therefore, more flexible than the existing deep ordinal classification techniques. Unlike these, it does not restrict the learned representation to a single and specific parametric model (or penalty) imposed on all the labels. Therefore, it enables the training to explore larger spaces of solutions, while removing the need for ad hoc choices and scaling up to large numbers of labels. It can be used in conjunction with any standard classification loss and any deep architecture. To tackle the ensuing challenging optimization problem, we solve a sequence of unconstrained losses based on a powerful extension of the log-barrier method. This handles effectively competing constraints and accommodates standard SGD for deep networks, while avoiding computationally expensive Lagrangian dual steps and outperforming substantially penalty methods. Furthermore, we propose a new performance metric for ordinal classification, as a proxy to measure distribution uni-modality, referred to as the Sides Order Index (SOI). We report comprehensive evaluations and comparisons to state-of-the-art methods on benchmark public datasets for several ordinal classification tasks, showing the merits of our approach in terms of label consistency, classification accuracy and scalability. Importantly, enforcing label consistency with our model does not incur higher classification errors, unlike many existing ordinal classification methods. A public reproducible PyTorch implementation is provided. (https://github.com/sbelharbi/unimodal-prob-deep-oc-free-distribution)
CVJul 25, 2019Code
Min-max Entropy for Weakly Supervised Pointwise LocalizationSoufiane Belharbi, Jérôme Rony, Jose Dolz et al.
Pointwise localization allows more precise localization and accurate interpretability, compared to bounding box, in applications where objects are highly unstructured such as in medical domain. In this work, we focus on weakly supervised localization (WSL) where a model is trained to classify an image and localize regions of interest at pixel-level using only global image annotation. Typical convolutional attentions maps are prune to high false positive regions. To alleviate this issue, we propose a new deep learning method for WSL, composed of a localizer and a classifier, where the localizer is constrained to determine relevant and irrelevant regions using conditional entropy (CE) with the aim to reduce false positive regions. Experimental results on a public medical dataset and two natural datasets, using Dice index, show that, compared to state of the art WSL methods, our proposal can provide significant improvements in terms of image-level classification and pixel-level localization (low false positive) with robustness to overfitting. A public reproducible PyTorch implementation is provided in: https://github.com/sbelharbi/wsol-min-max-entropy-interpretability .
CVApr 29, 2024
Source-Free Domain Adaptation of Weakly-Supervised Object Localization Models for HistologyAlexis Guichemerre, Soufiane Belharbi, Tsiry Mayet et al.
Given the emergence of deep learning, digital pathology has gained popularity for cancer diagnosis based on histology images. Deep weakly supervised object localization (WSOL) models can be trained to classify histology images according to cancer grade and identify regions of interest (ROIs) for interpretation, using inexpensive global image-class annotations. A WSOL model initially trained on some labeled source image data can be adapted using unlabeled target data in cases of significant domain shifts caused by variations in staining, scanners, and cancer type. In this paper, we focus on source-free (unsupervised) domain adaptation (SFDA), a challenging problem where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. SFDA of WSOL models raises several challenges in histology, most notably because they are not intended to adapt for both classification and localization tasks. In this paper, 4 state-of-the-art SFDA methods, each one representative of a main SFDA family, are compared for WSOL in terms of classification and localization accuracy. They are the SFDA-Distribution Estimation, Source HypOthesis Transfer, Cross-Domain Contrastive Learning, and Adaptively Domain Statistics Alignment. Experimental results on the challenging Glas (smaller, breast cancer) and Camelyon16 (larger, colon cancer) histology datasets indicate that these SFDA methods typically perform poorly for localization after adaptation when optimized for classification.
CVMar 31, 2025
PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI LocalizationAlexis Guichemerre, Soufiane Belharbi, Mohammadhadi Shateri et al.
Weakly supervised object localization (WSOL) methods allow training models to classify images and localize ROIs. WSOL only requires low-cost image-class annotations yet provides a visually interpretable classifier. Standard WSOL methods rely on class activation mapping (CAM) methods to produce spatial localization maps according to a single- or two-step strategy. While both strategies have made significant progress, they still face several limitations with histology images. Single-step methods can easily result in under- or over-activation due to the limited visual ROI saliency in histology images and scarce localization cues. They also face the well-known issue of asynchronous convergence between classification and localization tasks. The two-step approach is sub-optimal because it is constrained to a frozen classifier, limiting the capacity for localization. Moreover, these methods also struggle when applied to out-of-distribution (OOD) datasets. In this paper, a multi-task approach for WSOL is introduced for simultaneous training of both tasks to address the asynchronous convergence problem. In particular, localization is performed in the pixel-feature space of an image encoder that is shared with classification. This allows learning discriminant features and accurate delineation of foreground/background regions to support ROI localization and image classification. We propose PixelCAM, a cost-effective foreground/background pixel-wise classifier in the pixel-feature space that allows for spatial object localization. Using partial-cross entropy, PixelCAM is trained using pixel pseudo-labels collected from a pretrained WSOL model. Both image and pixel-wise classifiers are trained simultaneously using standard gradient descent. In addition, our pixel classifier can easily be integrated into CNN- and transformer-based architectures without any modifications.
25.9CVMar 12
Adaptation of Weakly Supervised Localization in Histopathology by Debiasing PredictionsAlexis Guichemerre, Banafsheh Karimian, Soufiane Belharbi et al.
Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. {\small Code: \href{https://anonymous.4open.science/r/SFDA-DeP-1797/}{anonymous.4open.science/r/SFDA-DeP-1797/}}
CVApr 22, 2025
CLIP-IT: CLIP-based Pairing for Histology Images ClassificationBanafsheh Karimian, Giulia Avanzato, Soufian Belharbi et al.
Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.
IVJan 7, 2022
Negative Evidence Matters in Interpretable Histology Image ClassificationSoufiane Belharbi, Marco Pedersoli, Ismail Ben Ayed et al.
Using only global image-class labels, weakly-supervised learning methods, such as class activation mapping, allow training CNNs to jointly classify an image, and locate regions of interest associated with the predicted class. However, without any guidance at the pixel level, such methods may yield inaccurate regions. This problem is known to be more challenging with histology images than with natural ones, since objects are less salient, structures have more variations, and foreground and background regions have stronger similarities. Therefore, computer vision methods for visual interpretation of CNNs may not directly apply. In this paper, a simple yet efficient method based on a composite loss is proposed to learn information from the fully negative samples (i.e., samples without positive regions), and thereby reduce false positives/negatives. Our new loss function contains two complementary terms: the first exploits positive evidence collected from the CNN classifier, while the second leverages the fully negative samples from training data. In particular, a pre-trained CNN is equipped with a decoder that allows refining the regions of interest. The CNN is exploited to collect both positive and negative evidence at the pixel level to train the decoder. Our method called NEGEV benefits from the fully negative samples that naturally occur in the data, without any additional supervision signals beyond image-class labels. Extensive experiments show that our proposed method can substantial outperform related state-of-art methods on GlaS (public benchmark for colon cancer), and Camelyon16 (patch-based benchmark for breast cancer using three different backbones). Our results highlight the benefits of using both positive and negative evidence, the first obtained from a classifier, and the other naturally available in datasets.
CVSep 15, 2021
F-CAM: Full Resolution Class Activation Maps via Guided Parametric UpscalingSoufiane Belharbi, Aydin Sarraf, Marco Pedersoli et al.
Class Activation Mapping (CAM) methods have recently gained much attention for weakly-supervised object localization (WSOL) tasks. They allow for CNN visualization and interpretation without training on fully annotated image datasets. CAM methods are typically integrated within off-the-shelf CNN backbones, such as ResNet50. Due to convolution and pooling operations, these backbones yield low resolution CAMs with a down-scaling factor of up to 32, contributing to inaccurate localizations. Interpolation is required to restore full size CAMs, yet it does not consider the statistical properties of objects, such as color and texture, leading to activations with inconsistent boundaries, and inaccurate localizations. As an alternative, we introduce a generic method for parametric upscaling of CAMs that allows constructing accurate full resolution CAMs (F-CAMs). In particular, we propose a trainable decoding architecture that can be connected to any CNN classifier to produce highly accurate CAM localizations. Given an original low resolution CAM, foreground and background pixels are randomly sampled to fine-tune the decoder. Additional priors such as image statistics and size constraints are also considered to expand and refine object boundaries. Extensive experiments, over three CNN backbones and six WSOL baselines on the CUB-200-2011 and OpenImages datasets, indicate that our F-CAM method yields a significant improvement in CAM localization accuracy. F-CAM performance is competitive with state-of-art WSOL methods, yet it requires fewer computations during inference.
CVNov 14, 2020
Deep Interpretable Classification and Weakly-Supervised Segmentation of Histology Images via Max-Min UncertaintySoufiane Belharbi, Jérôme Rony, Jose Dolz et al.
Weakly-supervised learning (WSL) has recently triggered substantial interest as it mitigates the lack of pixel-wise annotations. Given global image labels, WSL methods yield pixel-level predictions (segmentations), which enable to interpret class predictions. Despite their recent success, mostly with natural images, such methods can face important challenges when the foreground and background regions have similar visual cues, yielding high false-positive rates in segmentations, as is the case in challenging histology images. WSL training is commonly driven by standard classification losses, which implicitly maximize model confidence, and locate the discriminative regions linked to classification decisions. Therefore, they lack mechanisms for modeling explicitly non-discriminative regions and reducing false-positive rates. We propose novel regularization terms, which enable the model to seek both non-discriminative and discriminative regions, while discouraging unbalanced segmentations. We introduce high uncertainty as a criterion to localize non-discriminative regions that do not affect classifier decision, and describe it with original Kullback-Leibler (KL) divergence losses evaluating the deviation of posterior predictions from the uniform distribution. Our KL terms encourage high uncertainty of the model when the latter inputs the latent non-discriminative regions. Our loss integrates: (i) a cross-entropy seeking a foreground, where model confidence about class prediction is high; (ii) a KL regularizer seeking a background, where model uncertainty is high; and (iii) log-barrier terms discouraging unbalanced segmentations. Comprehensive experiments and ablation studies over the public GlaS colon cancer data and a Camelyon16 patch-based benchmark for breast cancer show substantial improvements over state-of-the-art WSL methods, and confirm the effect of our new regularizers.
CVOct 10, 2020
Deep Active Learning for Joint Classification & Segmentation with Weak AnnotatorSoufiane Belharbi, Ismail Ben Ayed, Luke McCaffrey et al.
CNN visualization and interpretation methods, like class-activation maps (CAMs), are typically used to highlight the image regions linked to class predictions. These models allow to simultaneously classify images and extract class-dependent saliency maps, without the need for costly pixel-level annotations. However, they typically yield segmentations with high false-positive rates and, therefore, coarse visualisations, more so when processing challenging images, as encountered in histology. To mitigate this issue, we propose an active learning (AL) framework, which progressively integrates pixel-level annotations during training. Given training data with global image-level labels, our deep weakly-supervised learning model jointly performs supervised image-level classification and active learning for segmentation, integrating pixel annotations by an oracle. Unlike standard AL methods that focus on sample selection, we also leverage large numbers of unlabeled images via pseudo-segmentations (i.e., self-learning at the pixel level), and integrate them with the oracle-annotated samples during training. We report extensive experiments over two challenging benchmarks -- high-resolution medical images (histology GlaS data for colon cancer) and natural images (CUB-200-2011 for bird species). Our results indicate that, by simply using random sample selection, the proposed approach can significantly outperform state-of the-art CAMs and AL methods, with an identical oracle-supervision budget. Our code is publicly available.
CVSep 8, 2019
Deep Weakly-Supervised Learning Methods for Classification and Localization in Histology Images: A SurveyJérôme Rony, Soufiane Belharbi, Jose Dolz et al.
Using deep learning models to diagnose cancer from histology data presents several challenges. Cancer grading and localization of regions of interest (ROIs) in these images normally relies on both image- and pixel-level labels, the latter requiring a costly annotation process. Deep weakly-supervised object localization (WSOL) methods provide different strategies for low-cost training of deep learning models. Using only image-class annotations, these methods can be trained to classify an image, and yield class activation maps (CAMs) for ROI localization. This paper provides a review of state-of-art DL methods for WSOL. We propose a taxonomy where these methods are divided into bottom-up and top-down methods according to the information flow in models. Although the latter have seen limited progress, recent bottom-up methods are currently driving much progress with deep WSOL methods. Early works focused on designing different spatial pooling functions. However, these methods reached limited localization accuracy, and unveiled a major limitation -- the under-activation of CAMs which leads to high false negative localization. Subsequent works aimed to alleviate this issue and recover complete object. Representative methods from our taxonomy are evaluated and compared in terms of classification and localization accuracy on two challenging histology datasets. Overall, the results indicate poor localization performance, particularly for generic methods that were initially designed to process natural images. Methods designed to address the challenges of histology data yielded good results. However, all methods suffer from high false positive/negative localization. Four key challenges are identified for the application of deep WSOL methods in histology -- under/over activation of CAMs, sensitivity to thresholding, and model selection.