Gustavo Carneiro

CV
h-index55
137papers
8,359citations
Novelty52%
AI Score62

137 Papers

CVNov 26, 2022Code
Residual Pattern Learning for Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

Yuyuan Liu, Choubo Ding, Yu Tian et al.

Semantic segmentation models classify pixels into a set of known (``in-distribution'') visual classes. When deployed in an open world, the reliability of these models depends on their ability not only to classify in-distribution pixels but also to detect out-of-distribution (OoD) pixels. Historically, the poor OoD detection performance of these models has motivated the design of methods based on model re-training using synthetic training images that include OoD visual objects. Although successful, these re-trained methods have two issues: 1) their in-distribution segmentation accuracy may drop during re-training, and 2) their OoD detection accuracy does not generalise well to new contexts (e.g., country surroundings) outside the training set (e.g., city surroundings). In this paper, we mitigate these issues with: (i) a new residual pattern learning (RPL) module that assists the segmentation model to detect OoD pixels without affecting the inlier segmentation performance; and (ii) a novel context-robust contrastive learning (CoroCL) that enforces RPL to robustly detect OoD pixels among various contexts. Our approach improves by around 10\% FPR and 7\% AuPRC the previous state-of-the-art in Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets. Our code is available at: https://github.com/yyliu01/RPL.

CVJul 26, 2023Code
Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Hu Wang, Yuanhong Chen, Congbo Ma et al.

The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour. The code repository address is https://github.com/billhhh/ShaSpec/.

CVMar 28, 2022Code
Translation Consistent Semi-supervised Segmentation for 3D Medical Images

Yuyuan Liu, Yu Tian, Chong Wang et al.

3D medical image segmentation methods have been successful, but their dependence on large amounts of voxel-level annotated data is a disadvantage that needs to be addressed given the high cost to obtain such annotation. Semi-supervised learning (SSL) solve this issue by training models with a large unlabelled and a small labelled dataset. The most successful SSL approaches are based on consistency learning that minimises the distance between model responses obtained from perturbed views of the unlabelled data. These perturbations usually keep the spatial input context between views fairly consistent, which may cause the model to learn segmentation patterns from the spatial input contexts instead of the segmented objects. In this paper, we introduce the Translation Consistent Co-training (TraCoCo) which is a consistency learning SSL method that perturbs the input data views by varying their spatial input context, allowing the model to learn segmentation patterns from visual objects. Furthermore, we propose the replacement of the commonly used mean squared error (MSE) semi-supervised loss by a new Cross-model confident Binary Cross entropy (CBC) loss, which improves training convergence and keeps the robustness to co-training pseudo-labelling mistakes. We also extend CutMix augmentation to 3D SSL to further improve generalisation. Our TraCoCo shows state-of-the-art results for the Left Atrium (LA) and Brain Tumor Segmentation (BRaTS19) datasets with different backbones. Our code is available at https://github.com/yyliu01/TraCoCo.

CVSep 13, 2022Code
On the Optimal Combination of Cross-Entropy and Soft Dice Losses for Lesion Segmentation with Out-of-Distribution Robustness

Adrian Galdran, Gustavo Carneiro, Miguel Ángel González Ballester

We study the impact of different loss functions on lesion segmentation from medical images. Although the Cross-Entropy (CE) loss is the most popular option when dealing with natural images, for biomedical image segmentation the soft Dice loss is often preferred due to its ability to handle imbalanced scenarios. On the other hand, the combination of both functions has also been successfully applied in this kind of tasks. A much less studied problem is the generalization ability of all these losses in the presence of Out-of-Distribution (OoD) data. This refers to samples appearing in test time that are drawn from a different distribution than training images. In our case, we train our models on images that always contain lesions, but in test time we also have lesion-free samples. We analyze the impact of the minimization of different loss functions on in-distribution performance, but also its ability to generalize to OoD data, via comprehensive experiments on polyp segmentation from endoscopic images and ulcer segmentation from diabetic feet images. Our findings are surprising: CE-Dice loss combinations that excel in segmenting in-distribution images have a poor performance when dealing with OoD data, which leads us to recommend the adoption of the CE loss for this kind of problems, due to its robustness and ability to generalize to OoD samples. Code associated to our experiments can be found at https://github.com/agaldran/lesion_losses_ood .

CVMar 2, 2023Code
Multi-Head Multi-Loss Model Calibration

Adrian Galdran, Johan Verjans, Gustavo Carneiro et al.

Delivering meaningful uncertainty estimates is essential for a successful deployment of machine learning models in the clinical practice. A central aspect of uncertainty quantification is the ability of a model to return predictions that are well-aligned with the actual probability of the model being correct, also known as model calibration. Although many methods have been proposed to improve calibration, no technique can match the simple, but expensive approach of training an ensemble of deep neural networks. In this paper we introduce a form of simplified ensembling that bypasses the costly training and inference of deep ensembles, yet it keeps its calibration capabilities. The idea is to replace the common linear classifier at the end of a network by a set of heads that are supervised with different loss functions to enforce diversity on their predictions. Specifically, each head is trained to minimize a weighted Cross-Entropy loss, but the weights are different among the different branches. We show that the resulting averaged predictions can achieve excellent calibration without sacrificing accuracy in two challenging datasets for histopathological and endoscopic image classification. Our experiments indicate that Multi-Head Multi-Loss classifiers are inherently well-calibrated, outperforming other recent calibration techniques and even challenging Deep Ensembles' performance. Code to reproduce our experiments can be found at \url{https://github.com/agaldran/mhml_calibration} .

CVJun 20, 2022Code
Test Time Transform Prediction for Open Set Histopathological Image Recognition

Adrian Galdran, Katherine J. Hewitt, Narmin L. Ghaffari et al.

Tissue typology annotation in Whole Slide histological images is a complex and tedious, yet necessary task for the development of computational pathology models. We propose to address this problem by applying Open Set Recognition techniques to the task of jointly classifying tissue that belongs to a set of annotated classes, e.g. clinically relevant tissue categories, while rejecting in test time Open Set samples, i.e. images that belong to categories not present in the training set. To this end, we introduce a new approach for Open Set histopathological image recognition based on training a model to accurately identify image categories and simultaneously predict which data augmentation transform has been applied. In test time, we measure model confidence in predicting this transform, which we expect to be lower for images in the Open Set. We carry out comprehensive experiments in the context of colorectal cancer assessment from histological images, which provide evidence on the strengths of our approach to automatically identify samples from unknown categories. Code is released at https://github.com/agaldran/t3po .

CVAug 2, 2023Code
Bridging Generative and Discriminative Noisy-Label Learning via Direction-Agnostic EM Formulation

Fengbei Liu, Chong Wang, Yuanhong Chen et al.

Although noisy-label learning is often approached with discriminative methods for simplicity and speed, generative modeling offers a principled alternative by capturing the joint mechanism that produces features, clean labels, and corrupted observations. However, prior work typically (i) introduces extra latent variables and heavy image generators that bias training toward reconstruction, (ii) fixes a single data-generating direction (\(Y\rightarrow\!X\) or \(X\rightarrow\!Y\)), limiting adaptability, and (iii) assumes a uniform prior over clean labels, ignoring instance-level uncertainty. We propose a single-stage, EM-style framework for generative noisy-label learning that is \emph{direction-agnostic} and avoids explicit image synthesis. First, we derive a single Expectation-Maximization (EM) objective whose E-step specializes to either causal orientation without changing the overall optimization. Second, we replace the intractable \(p(X\mid Y)\) with a dataset-normalized discriminative proxy computed using a discriminative classifier on the finite training set, retaining the structural benefits of generative modeling at much lower cost. Third, we introduce \emph{Partial-Label Supervision} (PLS), an instance-specific prior over clean labels that balances coverage and uncertainty, improving data-dependent regularization. Across standard vision and natural language processing (NLP) noisy-label benchmarks, our method achieves state-of-the-art accuracy, lower transition-matrix estimation error, and substantially less training compute than current generative and discriminative baselines. Code: https://github.com/lfb-1/GNL

IVFeb 3, 2023
AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge

Coen de Vente, Koenraad A. Vermeer, Nicolas Jaccard et al.

The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.

CVDec 27, 2025Code
Visual Autoregressive Modelling for Monocular Depth Estimation

Amir El-Ghoussani, André Kaup, Nassir Navab et al.

We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".

CVJan 8, 2023
Learning Support and Trivial Prototypes for Interpretable Image Classification

Chong Wang, Yuyuan Liu, Yuanhong Chen et al.

Prototypical part network (ProtoPNet) methods have been designed to achieve interpretable classification by associating predictions with a set of training prototypes, which we refer to as trivial prototypes because they are trained to lie far from the classification boundary in the feature space. Note that it is possible to make an analogy between ProtoPNet and support vector machine (SVM) given that the classification from both methods relies on computing similarity with a set of training points (i.e., trivial prototypes in ProtoPNet, and support vectors in SVM). However, while trivial prototypes are located far from the classification boundary, support vectors are located close to this boundary, and we argue that this discrepancy with the well-established SVM theory can result in ProtoPNet models with inferior classification accuracy. In this paper, we aim to improve the classification of ProtoPNet with a new method to learn support prototypes that lie near the classification boundary in the feature space, as suggested by the SVM theory. In addition, we target the improvement of classification results with a new model, named ST-ProtoPNet, which exploits our support prototypes and the trivial prototypes to provide more effective classification. Experimental results on CUB-200-2011, Stanford Cars, and Stanford Dogs datasets demonstrate that ST-ProtoPNet achieves state-of-the-art classification accuracy and interpretability results. We also show that the proposed support prototypes tend to be better localised in the object of interest rather than in the background region.

CVSep 2, 2022
Instance-Dependent Noisy Label Learning via Graphical Modelling

Arpit Garg, Cuong Nguyen, Rafael Felix et al.

Noisy labels are unavoidable yet troublesome in the ecosystem of deep learning because models can easily overfit them. There are many types of label noise, such as symmetric, asymmetric and instance-dependent noise (IDN), with IDN being the only type that depends on image information. Such dependence on image information makes IDN a critical type of label noise to study, given that labelling mistakes are caused in large part by insufficient or ambiguous information about the visual classes present in images. Aiming to provide an effective technique to address IDN, we present a new graphical modelling approach called InstanceGM, that combines discriminative and generative models. The main contributions of InstanceGM are: i) the use of the continuous Bernoulli distribution to train the generative model, offering significant training advantages, and ii) the exploration of a state-of-the-art noisy-label discriminative classifier to generate clean labels from instance-dependent noisy-label samples. InstanceGM is competitive with current noisy-label learning approaches, particularly in IDN benchmarks using synthetic and real-world datasets, where our method shows better accuracy than the competitors in most experiments.

IVJul 5, 2023
Distilling Missing Modality Knowledge from Ultrasound for Endometriosis Diagnosis with Magnetic Resonance Images

Yuan Zhang, Hu Wang, David Butler et al.

Endometriosis is a common chronic gynecological disorder that has many characteristics, including the pouch of Douglas (POD) obliteration, which can be diagnosed using Transvaginal gynecological ultrasound (TVUS) scans and magnetic resonance imaging (MRI). TVUS and MRI are complementary non-invasive endometriosis diagnosis imaging techniques, but patients are usually not scanned using both modalities and, it is generally more challenging to detect POD obliteration from MRI than TVUS. To mitigate this classification imbalance, we propose in this paper a knowledge distillation training algorithm to improve the POD obliteration detection from MRI by leveraging the detection results from unpaired TVUS data. More specifically, our algorithm pre-trains a teacher model to detect POD obliteration from TVUS data, and it also pre-trains a student model with 3D masked auto-encoder using a large amount of unlabelled pelvic 3D MRI volumes. Next, we distill the knowledge from the teacher TVUS POD obliteration detector to train the student MRI model by minimizing a regression loss that approximates the output of the student to the teacher using unpaired TVUS and MRI data. Experimental results on our endometriosis dataset containing TVUS and MRI data demonstrate the effectiveness of our method to improve the POD detection accuracy from MRI.

CVMar 23, 2022
Contrastive Transformer-based Multiple Instance Learning for Weakly Supervised Polyp Frame Detection

Yu Tian, Guansong Pang, Fengbei Liu et al.

Current polyp detection methods from colonoscopy videos use exclusively normal (i.e., healthy) training images, which i) ignore the importance of temporal information in consecutive video frames, and ii) lack knowledge about the polyps. Consequently, they often have high detection errors, especially on challenging polyp cases (e.g., small, flat, or partially visible polyps). In this work, we formulate polyp detection as a weakly-supervised anomaly detection task that uses video-level labelled training data to detect frame-level polyps. In particular, we propose a novel convolutional transformer-based multiple instance learning method designed to identify abnormal frames (i.e., frames with polyps) from anomalous videos (i.e., videos containing at least one frame with polyp). In our method, local and global temporal dependencies are seamlessly captured while we simultaneously optimise video and snippet-level anomaly scores. A contrastive snippet mining method is also proposed to enable an effective modelling of the challenging polyp cases. The resulting method achieves a detection accuracy that is substantially better than current state-of-the-art approaches on a new large-scale colonoscopy video dataset introduced in this work.

IVMar 22, 2022
Unsupervised Anomaly Detection in Medical Images with a Memory-augmented Multi-level Cross-attentional Masked Autoencoder

Yu Tian, Guansong Pang, Yuyuan Liu et al.

Unsupervised anomaly detection (UAD) aims to find anomalous images by optimising a detector using a training set that contains only normal images. UAD approaches can be based on reconstruction methods, self-supervised approaches, and Imagenet pre-trained models. Reconstruction methods, which detect anomalies from image reconstruction errors, are advantageous because they do not rely on the design of problem-specific pretext tasks needed by self-supervised approaches, and on the unreliable translation of models pre-trained from non-medical datasets. However, reconstruction methods may fail because they can have low reconstruction errors even for anomalous images. In this paper, we introduce a new reconstruction-based UAD approach that addresses this low-reconstruction error issue for anomalous images. Our UAD approach, the memory-augmented multi-level cross-attentional masked autoencoder (MemMC-MAE), is a transformer-based approach, consisting of a novel memory-augmented self-attention operator for the encoder and a new multi-level cross-attention operator for the decoder. MemMCMAE masks large parts of the input image during its reconstruction, reducing the risk that it will produce low reconstruction errors because anomalies are likely to be masked and cannot be reconstructed. However, when the anomaly is not masked, then the normal patterns stored in the encoder's memory combined with the decoder's multi-level cross attention will constrain the accurate reconstruction of the anomaly. We show that our method achieves SOTA anomaly detection and localisation on colonoscopy, pneumonia, and covid-19 chest x-ray datasets.

CVApr 6, 2023
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Yuanhong Chen, Yuyuan Liu, Hu Wang et al.

Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.

CVSep 21, 2022
Multi-view Local Co-occurrence and Global Consistency Learning Improve Mammogram Classification Generalisation

Yuanhong Chen, Hu Wang, Chong Wang et al.

When analysing screening mammograms, radiologists can naturally process information across two ipsilateral views of each breast, namely the cranio-caudal (CC) and mediolateral-oblique (MLO) views. These multiple related images provide complementary diagnostic information and can improve the radiologist's classification accuracy. Unfortunately, most existing deep learning systems, trained with globally-labelled images, lack the ability to jointly analyse and integrate global and local information from these multiple views. By ignoring the potentially valuable information present in multiple images of a screening episode, one limits the potential accuracy of these systems. Here, we propose a new multi-view global-local analysis method that mimics the radiologist's reading procedure, based on a global consistency learning and local co-occurrence learning of ipsilateral views in mammograms. Extensive experiments show that our model outperforms competing methods, in terms of classification accuracy and generalisation, on a large-scale private dataset and two publicly available datasets, where models are exclusively trained and tested with global labels.

CVSep 26, 2022
Knowledge Distillation to Ensemble Global and Interpretable Prototype-Based Mammogram Classification Models

Chong Wang, Yuanhong Chen, Yuyuan Liu et al.

State-of-the-art (SOTA) deep learning mammogram classifiers, trained with weakly-labelled images, often rely on global models that produce predictions with limited interpretability, which is a key barrier to their successful translation into clinical practice. On the other hand, prototype-based models improve interpretability by associating predictions with training image prototypes, but they are less accurate than global models and their prototypes tend to have poor diversity. We address these two issues with the proposal of BRAIxProtoPNet++, which adds interpretability to a global model by ensembling it with a prototype-based model. BRAIxProtoPNet++ distills the knowledge of the global model when training the prototype-based model with the goal of increasing the classification accuracy of the ensemble. Moreover, we propose an approach to increase prototype diversity by guaranteeing that all prototypes are associated with different training images. Experiments on weakly-labelled private and public datasets show that BRAIxProtoPNet++ has higher classification accuracy than SOTA global and prototype-based models. Using lesion localisation to assess model interpretability, we show BRAIxProtoPNet++ is more effective than other prototype-based models and post-hoc explanation of global models. Finally, we show that the diversity of the prototypes learned by BRAIxProtoPNet++ is superior to SOTA prototype-based approaches.

CVJul 22, 2022
Uncertainty-aware Multi-modal Learning via Cross-modal Random Network Prediction

Hu Wang, Jianpeng Zhang, Yuanhong Chen et al.

Multi-modal learning focuses on training models by equally combining multiple input data modalities during the prediction process. However, this equal combination can be detrimental to the prediction accuracy because different modalities are usually accompanied by varying levels of uncertainty. Using such uncertainty to combine modalities has been studied by a couple of approaches, but with limited success because these approaches are either designed to deal with specific classification or segmentation problems and cannot be easily translated into other tasks, or suffer from numerical instabilities. In this paper, we propose a new Uncertainty-aware Multi-modal Learner that estimates uncertainty by measuring feature density via Cross-modal Random Network Prediction (CRNP). CRNP is designed to require little adaptation to translate between different prediction tasks, while having a stable training process. From a technical point of view, CRNP is the first approach to explore random network prediction to estimate uncertainty and to combine multi-modal data. Experiments on two 3D multi-modal medical image segmentation tasks and three 2D multi-modal computer vision classification tasks show the effectiveness, adaptability and robustness of CRNP. Also, we provide an extensive discussion on different fusion functions and visualization to validate the proposed model.

IVMar 3, 2022
BoMD: Bag of Multi-label Descriptors for Noisy Chest X-ray Classification

Yuanhong Chen, Fengbei Liu, Hu Wang et al.

Deep learning methods have shown outstanding classification accuracy in medical imaging problems, which is largely attributed to the availability of large-scale datasets manually annotated with clean labels. However, given the high cost of such manual annotation, new medical imaging classification problems may need to rely on machine-generated noisy labels extracted from radiology reports. Indeed, many Chest X-ray (CXR) classifiers have already been modelled from datasets with noisy labels, but their training procedure is in general not robust to noisy-label samples, leading to sub-optimal models. Furthermore, CXR datasets are mostly multi-label, so current noisy-label learning methods designed for multi-class problems cannot be easily adapted. In this paper, we propose a new method designed for the noisy multi-label CXR learning, which detects and smoothly re-labels samples from the dataset, which is then used to train common multi-label classifiers. The proposed method optimises a bag of multi-label descriptors (BoMD) to promote their similarity with the semantic descriptors produced by BERT models from the multi-label image annotation. Our experiments on diverse noisy multi-label training sets and clean testing sets show that our model has state-of-the-art accuracy and robustness in many CXR multi-label classification benchmarks.

CVJul 9, 2024Code
ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation

Yuyuan Liu, Yuanhong Chen, Hu Wang et al.

The costly and time-consuming annotation process to produce large training sets for modelling semantic LiDAR segmentation methods has motivated the development of semi-supervised learning (SSL) methods. However, such SSL approaches often concentrate on employing consistency learning only for individual LiDAR representations. This narrow focus results in limited perturbations that generally fail to enable effective consistency learning. Additionally, these SSL approaches employ contrastive learning based on the sampling from a limited set of positive and negative embedding samples. This paper introduces a novel semi-supervised LiDAR semantic segmentation framework called ItTakesTwo (IT2). IT2 is designed to ensure consistent predictions from peer LiDAR representations, thereby improving the perturbation effectiveness in consistency learning. Furthermore, our contrastive learning employs informative samples drawn from a distribution of positive and negative embeddings learned from the entire training set. Results on public benchmarks show that our approach achieves remarkable improvements over the previous state-of-the-art (SOTA) methods in the field. The code is available at: https://github.com/yyliu01/IT2.

CVOct 17, 2022
Bootstrapping the Relationship Between Images and Their Clean and Noisy Labels

Brandon Smart, Gustavo Carneiro

Many state-of-the-art noisy-label learning methods rely on learning mechanisms that estimate the samples' clean labels during training and discard their original noisy labels. However, this approach prevents the learning of the relationship between images, noisy labels and clean labels, which has been shown to be useful when dealing with instance-dependent label noise problems. Furthermore, methods that do aim to learn this relationship require cleanly annotated subsets of data, as well as distillation or multi-faceted models for training. In this paper, we propose a new training algorithm that relies on a simple model to learn the relationship between clean and noisy labels without the need for a cleanly labelled subset of data. Our algorithm follows a 3-stage process, namely: 1) self-supervised pre-training followed by an early-stopping training of the classifier to confidently predict clean labels for a subset of the training set; 2) use the clean set from stage (1) to bootstrap the relationship between images, noisy labels and clean labels, which we exploit for effective relabelling of the remaining training set using semi-supervised learning; and 3) supervised training of the classifier with all relabelled samples from stage (2). By learning this relationship, we achieve state-of-the-art performance in asymmetric and instance-dependent label noise problems.

CVAug 23, 2022
A Study on the Impact of Data Augmentation for Training Convolutional Neural Networks in the Presence of Noisy Labels

Emeson Santana, Gustavo Carneiro, Filipe R. Cordeiro

Label noise is common in large real-world datasets, and its presence harms the training process of deep neural networks. Although several works have focused on the training strategies to address this problem, there are few studies that evaluate the impact of data augmentation as a design choice for training deep neural networks. In this work, we analyse the model robustness when using different data augmentations and their improvement on the training with the presence of noisy labels. We evaluate state-of-the-art and classical data augmentation strategies with different levels of synthetic noise for the datasets MNist, CIFAR-10, CIFAR-100, and the real-world dataset Clothing1M. We evaluate the methods using the accuracy metric. Results show that the appropriate selection of data augmentation can drastically improve the model robustness to label noise, increasing up to 177.84% of relative best test accuracy compared to the baseline with no augmentation, and an increase of up to 6% in absolute value with the state-of-the-art DivideMix training strategy.

CVMay 26, 2022
Censor-aware Semi-supervised Learning for Survival Time Prediction from Medical Images

Renato Hermoza, Gabriel Maicas, Jacinto C. Nascimento et al.

Survival time prediction from medical images is important for treatment planning, where accurate estimations can improve healthcare quality. One issue affecting the training of survival models is censored data. Most of the current survival prediction approaches are based on Cox models that can deal with censored data, but their application scope is limited because they output a hazard function instead of a survival time. On the other hand, methods that predict survival time usually ignore censored data, resulting in an under-utilization of the training set. In this work, we propose a new training method that predicts survival time using all censored and uncensored data. We propose to treat censored data as samples with a lower-bound time to death and estimate pseudo labels to semi-supervise a censor-aware survival time regressor. We evaluate our method on pathology and x-ray images from the TCGA-GM and NLST datasets. Our results establish the state-of-the-art survival prediction accuracy on both datasets.

CVNov 18, 2022
Knowing What to Label for Few Shot Microscopy Image Cell Segmentation

Youssef Dawoud, Arij Bouazizi, Katharina Ernst et al.

In microscopy image cell segmentation, it is common to train a deep neural network on source data, containing different types of microscopy images, and then fine-tune it using a support set comprising a few randomly selected and annotated training target images. In this paper, we argue that the random selection of unlabelled training target images to be annotated and included in the support set may not enable an effective fine-tuning process, so we propose a new approach to optimise this image selection process. Our approach involves a new scoring function to find informative unlabelled target images. In particular, we propose to measure the consistency in the model predictions on target images against specific data augmentations. However, we observe that the model trained with source datasets does not reliably evaluate consistency on target images. To alleviate this problem, we propose novel self-supervised pretext tasks to compute the scores of unlabelled target images. Finally, the top few images with the least consistency scores are added to the support set for oracle (i.e., expert) annotation and later used to fine-tune the model to the target images. In our evaluations that involve the segmentation of five different types of cell images, we demonstrate promising results on several target test sets compared to the random selection approach as well as other selection approaches, such as Shannon's entropy and Monte-Carlo dropout.

CVMar 20, 2023
PASS: Peer-Agreement based Sample Selection for training with Noisy Labels

Arpit Garg, Cuong Nguyen, Rafael Felix et al.

The prevalence of noisy-label samples poses a significant challenge in deep learning, inducing overfitting effects. This has, therefore, motivated the emergence of learning with noisy-label (LNL) techniques that focus on separating noisy- and clean-label samples to apply different learning strategies to each group of samples. Current methodologies often rely on the small-loss hypothesis or feature-based selection to separate noisy- and clean-label samples, yet our empirical observations reveal their limitations, especially for labels with instance dependent noise (IDN). An important characteristic of IDN is the difficulty to distinguish the clean-label samples that lie near the decision boundary (i.e., the hard samples) from the noisy-label samples. We, therefore, propose a new noisy-label detection method, termed Peer-Agreement based Sample Selection (PASS), to address this problem. Utilising a trio of classifiers, PASS employs consensus-driven peer-based agreement of two models to select the samples to train the remaining model. PASS is easily integrated into existing LNL models, enabling the improvement of the detection accuracy of noisy- and clean-label samples, which increases the classification accuracy across various LNL benchmarks.

CVAug 9, 2023
SelectNAdapt: Support Set Selection for Few-Shot Domain Adaptation

Youssef Dawoud, Gustavo Carneiro, Vasileios Belagiannis

Generalisation of deep neural networks becomes vulnerable when distribution shifts are encountered between train (source) and test (target) domain data. Few-shot domain adaptation mitigates this issue by adapting deep neural networks pre-trained on the source domain to the target domain using a randomly selected and annotated support set from the target domain. This paper argues that randomly selecting the support set can be further improved for effectively adapting the pre-trained source models to the target domain. Alternatively, we propose SelectNAdapt, an algorithm to curate the selection of the target domain samples, which are then annotated and included in the support set. In particular, for the K-shot adaptation problem, we first leverage self-supervision to learn features of the target domain data. Then, we propose a per-class clustering scheme of the learned target domain features and select K representative target samples using a distance-based scoring function. Finally, we bring our selection setup towards a practical ground by relying on pseudo-labels for clustering semantically similar target domain samples. Our experiments show promising results on three few-shot domain adaptation benchmarks for image recognition compared to related approaches and the standard random selection.

93.8CVApr 16Code
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

Amir El-Ghoussani, Marc Hölle, Gustavo Carneiro et al.

We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'https://github.com/AmirMaEl/MLN'.

LGJan 4, 2023
Task Weighting in Meta-learning with Trajectory Optimisation

Cuong Nguyen, Thanh-Toan Do, Gustavo Carneiro

Developing meta-learning algorithms that are un-biased toward a subset of training tasks often requires hand-designed criteria to weight tasks, potentially resulting in sub-optimal solutions. In this paper, we introduce a new principled and fully-automated task-weighting algorithm for meta-learning methods. By considering the weights of tasks within the same mini-batch as an action, and the meta-parameter of interest as the system state, we cast the task-weighting meta-learning problem to a trajectory optimisation and employ the iterative linear quadratic regulator to determine the optimal action or weights of tasks. We theoretically show that the proposed algorithm converges to an $ε_{0}$-stationary point, and empirically demonstrate that the proposed approach out-performs common hand-engineering weighting methods in two few-shot learning benchmarks.

CVAug 3, 2022
Edge-Based Self-Supervision for Semi-Supervised Few-Shot Microscopy Image Cell Segmentation

Youssef Dawoud, Katharina Ernst, Gustavo Carneiro et al.

Deep neural networks currently deliver promising results for microscopy image cell segmentation, but they require large-scale labelled databases, which is a costly and time-consuming process. In this work, we relax the labelling requirement by combining self-supervised with semi-supervised learning. We propose the prediction of edge-based maps for self-supervising the training of the unlabelled images, which is combined with the supervised training of a small number of labelled images for learning the segmentation task. In our experiments, we evaluate on a few-shot microscopy image cell segmentation benchmark and show that only a small number of annotated images, e.g. 10% of the original training set, is enough for our approach to reach similar performance as with the fully annotated databases on 1- to 10-shots. Our code and trained models is made publicly available

45.3CVMay 28
Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging

Milad Masroor, Cuong Nguyen, Kevin Wells et al.

Medical image analysis models can exhibit performance disparities across patient subgroups, threatening clinical safety and fairness. Existing methods typically address this issue by optimizing accuracy and fairness metrics for visible demographic attributes (e.g., sex or age) considered in isolation. This strategy not only overlooks potentially more informative latent stratifications, which may reveal deeper sources of model error and inequity, but also fails to scale when multiple demographic attributes are considered simultaneously due to the resulting sparsity of training data within each subgroup. We deal with these issues by introducing the label-free hidden-cohort fairness (LHCF) training paradigm that instead of maximizing fairness over visible demographic attributes, it optimizes fairness across latent subpopulations discovered from image appearance. By clustering images into K appearance-based cohorts and applying fairness optimization over them, LHCF uncovers underlying sources of model error and avoids the combinatorial sparsity of multi-demographic attributes, reducing disparities across both single and multiple demographic attributes. We demonstrate on our proposed fairness benchmark, HIDFairBench, that LHCF provides state-of-the-art fairness results on single and multiple demographic attributes, despite never using demographic labels for training. Our results position hidden-cohort fairness as a practical, scalable, and robust alternative to demographic-based fairness optimization for trustworthy medical image analysis.

LGAug 17, 2022
Maximising the Utility of Validation Sets for Imbalanced Noisy-label Meta-learning

Dung Anh Hoang, Cuong Nguyen, Belagiannis Vasileios et al.

Meta-learning is an effective method to handle imbalanced and noisy-label learning, but it depends on a validation set containing randomly selected, manually labelled and balanced distributed samples. The random selection and manual labelling and balancing of this validation set is not only sub-optimal for meta-learning, but it also scales poorly with the number of classes. Hence, recent meta-learning papers have proposed ad-hoc heuristics to automatically build and label this validation set, but these heuristics are still sub-optimal for meta-learning. In this paper, we analyse the meta-learning algorithm and propose new criteria to characterise the utility of the validation set, based on: 1) the informativeness of the validation set; 2) the class distribution balance of the set; and 3) the correctness of the labels of the set. Furthermore, we propose a new imbalanced noisy-label meta-learning (INOLML) algorithm that automatically builds a validation set by maximising its utility using the criteria above. Our method shows significant improvements over previous meta-learning approaches and sets the new state-of-the-art on several benchmarks.

LGApr 28, 2022
Mixup-based Deep Metric Learning Approaches for Incomplete Supervision

Luiz H. Buris, Daniel C. G. Pedronette, Joao P. Papa et al.

Deep learning architectures have achieved promising results in different areas (e.g., medicine, agriculture, and security). However, using those powerful techniques in many real applications becomes challenging due to the large labeled collections required during training. Several works have pursued solutions to overcome it by proposing strategies that can learn more for less, e.g., weakly and semi-supervised learning approaches. As these approaches do not usually address memorization and sensitivity to adversarial examples, this paper presents three deep metric learning approaches combined with Mixup for incomplete-supervision scenarios. We show that some state-of-the-art approaches in metric learning might not work well in such scenarios. Moreover, the proposed approaches outperform most of them in different datasets.

CVJan 1, 2023
Asymmetric Co-teaching with Multi-view Consensus for Noisy Label Learning

Fengbei Liu, Yuanhong Chen, Chong Wang et al.

Learning with noisy-labels has become an important research topic in computer vision where state-of-the-art (SOTA) methods explore: 1) prediction disagreement with co-teaching strategy that updates two models when they disagree on the prediction of training samples; and 2) sample selection to divide the training set into clean and noisy sets based on small training loss. However, the quick convergence of co-teaching models to select the same clean subsets combined with relatively fast overfitting of noisy labels may induce the wrong selection of noisy label samples as clean, leading to an inevitable confirmation bias that damages accuracy. In this paper, we introduce our noisy-label learning approach, called Asymmetric Co-teaching (AsyCo), which introduces novel prediction disagreement that produces more consistent divergent results of the co-teaching models, and a new sample selection approach that does not require small-loss assumption to enable a better robustness to confirmation bias than previous methods. More specifically, the new prediction disagreement is achieved with the use of different training strategies, where one model is trained with multi-class learning and the other with multi-label learning. Also, the new sample selection is based on multi-view consensus, which uses the label views from training labels and model predictions to divide the training set into clean and noisy for training the multi-class model and to re-label the training samples with multiple top-ranked labels for training the multi-label model. Extensive experiments on synthetic and real-world noisy-label datasets show that AsyCo improves over current SOTA methods.

CVAug 23, 2022
An Evolutionary Approach for Creating of Diverse Classifier Ensembles

Alvaro R. Ferreira, Fabio A. Faria, Gustavo Carneiro et al.

Classification is one of the most studied tasks in data mining and machine learning areas and many works in the literature have been presented to solve classification problems for multiple fields of knowledge such as medicine, biology, security, and remote sensing. Since there is no single classifier that achieves the best results for all kinds of applications, a good alternative is to adopt classifier fusion strategies. A key point in the success of classifier fusion approaches is the combination of diversity and accuracy among classifiers belonging to an ensemble. With a large amount of classification models available in the literature, one challenge is the choice of the most suitable classifiers to compose the final classification system, which generates the need of classifier selection strategies. We address this point by proposing a framework for classifier selection and fusion based on a four-step protocol called CIF-E (Classifiers, Initialization, Fitness function, and Evolutionary algorithm). We implement and evaluate 24 varied ensemble approaches following the proposed CIF-E protocol and we are able to find the most accurate approach. A comparative analysis has also been performed among the best approaches and many other baselines from the literature. The experiments show that the proposed evolutionary approach based on Univariate Marginal Distribution Algorithm (UMDA) can outperform the state-of-the-art literature approaches in many well-known UCI datasets.

CVOct 2, 2023
Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality

Hu Wang, Congbo Ma, Jianpeng Zhang et al.

The problem of missing modalities is both critical and non-trivial to be handled in multi-modal models. It is common for multi-modal tasks that certain modalities contribute more compared to other modalities, and if those important modalities are missing, the model performance drops significantly. Such fact remains unexplored by current multi-modal approaches that recover the representation from missing modalities by feature reconstruction or blind feature aggregation from other modalities, instead of extracting useful information from the best performing modalities. In this paper, we propose a Learnable Cross-modal Knowledge Distillation (LCKD) model to adaptively identify important modalities and distil knowledge from them to help other modalities from the cross-modal perspective for solving the missing modality issue. Our approach introduces a teacher election procedure to select the most ``qualified'' teachers based on their single modality performance on certain tasks. Then, cross-modal knowledge distillation is performed between teacher and student modalities for each task to push the model parameters to a point that is beneficial for all tasks. Hence, even if the teacher modalities for certain tasks are missing during testing, the available student modalities can accomplish the task well enough based on the learned knowledge from their automatically elected teacher modalities. Experiments on the Brain Tumour Segmentation Dataset 2018 (BraTS2018) shows that LCKD outperforms other methods by a considerable margin, improving the state-of-the-art performance by 3.61% for enhancing tumour, 5.99% for tumour core, and 3.76% for whole tumour in terms of segmentation Dice score.

LGJul 3, 2024
Model and Feature Diversity for Bayesian Neural Networks in Mutual Learning

Cuong Pham, Cuong C. Nguyen, Trung Le et al.

Bayesian Neural Networks (BNNs) offer probability distributions for model parameters, enabling uncertainty quantification in predictions. However, they often underperform compared to deterministic neural networks. Utilizing mutual learning can effectively enhance the performance of peer BNNs. In this paper, we propose a novel approach to improve BNNs performance through deep mutual learning. The proposed approaches aim to increase diversity in both network parameter distributions and feature distributions, promoting peer networks to acquire distinct features that capture different characteristics of the input, which enhances the effectiveness of mutual learning. Experimental results demonstrate significant improvements in the classification accuracy, negative log-likelihood, and expected calibration error when compared to traditional mutual learning for BNNs.

CVJul 7, 2024
CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

Yuanhong Chen, Chong Wang, Yuyuan Liu et al.

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.

CVJul 10, 2024Code
Bayesian Detector Combination for Object Detection with Crowdsourced Annotations

Zhi Qin Tan, Olga Isupova, Gustavo Carneiro et al.

Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise, especially in crowdsourcing scenarios. Most prior object detection methods assume accurate annotations; A few recent works have studied object detection with noisy crowdsourced annotations, with evaluation on distinct synthetic crowdsourced datasets of varying setups under artificial assumptions. To address these algorithmic limitations and evaluation inconsistency, we first propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations, with the unique ability of automatically inferring the annotators' label qualities. Unlike previous approaches, BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models. Due to the scarcity of real-world crowdsourced datasets, we introduce large synthetic datasets by simulating varying crowdsourcing scenarios. This allows consistent evaluation of different models at scale. Extensive experiments on both real and synthetic crowdsourced datasets show that BDC outperforms existing state-of-the-art methods, demonstrating its superiority in leveraging crowdsourced data for object detection. Our code and data are available at https://github.com/zhiqin1998/bdc.

CVNov 30, 2023
Mixture of Gaussian-distributed Prototypes with Generative Modelling for Interpretable and Trustworthy Image Recognition

Chong Wang, Yuanhong Chen, Fengbei Liu et al.

Prototypical-part methods, e.g., ProtoPNet, enhance interpretability in image recognition by linking predictions to training prototypes, thereby offering intuitive insights into their decision-making. Existing methods, which rely on a point-based learning of prototypes, typically face two critical issues: 1) the learned prototypes have limited representation power and are not suitable to detect Out-of-Distribution (OoD) inputs, reducing their decision trustworthiness; and 2) the necessary projection of the learned prototypes back into the space of training images causes a drastic degradation in the predictive performance. Furthermore, current prototype learning adopts an aggressive approach that considers only the most active object parts during training, while overlooking sub-salient object regions which still hold crucial classification information. In this paper, we present a new generative paradigm to learn prototype distributions, termed as Mixture of Gaussian-distributed Prototypes (MGProto). The distribution of prototypes from MGProto enables both interpretable image classification and trustworthy recognition of OoD inputs. The optimisation of MGProto naturally projects the learned prototype distributions back into the training image space, thereby addressing the performance degradation caused by prototype projection. Additionally, we develop a novel and effective prototype mining strategy that considers not only the most active but also sub-salient object parts. To promote model compactness, we further propose to prune MGProto by removing prototypes with low importance priors. Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and Oxford-IIIT Pets datasets show that MGProto achieves state-of-the-art image recognition and OoD detection performances, while providing encouraging interpretability results.

CVJul 20, 2024
MetaAug: Meta-Data Augmentation for Post-Training Quantization

Cuong Pham, Hoang Anh Dung, Cuong C. Nguyen et al.

Post-Training Quantization (PTQ) has received significant attention because it requires only a small set of calibration data to quantize a full-precision model, which is more practical in real-world applications in which full access to a large training set is not available. However, it often leads to overfitting on the small calibration dataset. Several methods have been proposed to address this issue, yet they still rely on only the calibration set for the quantization and they do not validate the quantized model due to the lack of a validation set. In this work, we propose a novel meta-learning based approach to enhance the performance of post-training quantization. Specifically, to mitigate the overfitting problem, instead of only training the quantized model using the original calibration set without any validation during the learning process as in previous PTQ works, in our approach, we both train and validate the quantized model using two different sets of images. In particular, we propose a meta-learning based approach to jointly optimize a transformation network and a quantized model through bi-level optimization. The transformation network modifies the original calibration data and the modified data will be used as the training set to learn the quantized model with the objective that the quantized model achieves a good performance on the original calibration data. Extensive experiments on the widely used ImageNet dataset with different neural network architectures demonstrate that our approach outperforms the state-of-the-art PTQ methods.

CVSep 3, 2024
Human-AI Collaborative Multi-modal Multi-rater Learning for Endometriosis Diagnosis

Hu Wang, David Butler, Yuan Zhang et al.

Endometriosis, affecting about 10% of individuals assigned female at birth, is challenging to diagnose and manage. Diagnosis typically involves the identification of various signs of the disease using either laparoscopic surgery or the analysis of T1/T2 MRI images, with the latter being quicker and cheaper but less accurate. A key diagnostic sign of endometriosis is the obliteration of the Pouch of Douglas (POD). However, even experienced clinicians struggle with accurately classifying POD obliteration from MRI images, which complicates the training of reliable AI models. In this paper, we introduce the Human-AI Collaborative Multi-modal Multi-rater Learning (HAICOMM) methodology to address the challenge above. HAICOMM is the first method that explores three important aspects of this problem: 1) multi-rater learning to extract a cleaner label from the multiple "noisy" labels available per training sample; 2) multi-modal learning to leverage the presence of T1/T2 MRI images for training and testing; and 3) human-AI collaboration to build a system that leverages the predictions from clinicians and the AI model to provide more accurate classification than standalone clinicians and AI models. Presenting results on the multi-rater T1/T2 MRI endometriosis dataset that we collected to validate our methodology, the proposed HAICOMM model outperforms an ensemble of clinicians, noisy-label learning models, and multi-rater learning methods.

CVSep 12, 2024
Deep Multimodal Learning with Missing Modality: A Survey

Renjie Wu, Hu Wang, Hsiang-Ting Chen et al.

During multimodal model training and testing, certain data modalities may be absent due to sensor limitations, cost constraints, privacy concerns, or data loss, negatively affecting performance. Multimodal learning techniques designed to handle missing modalities can mitigate this by ensuring model robustness even when some modalities are unavailable. This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning methods. It provides the first comprehensive survey that covers the motivation and distinctions between MLMM and standard multimodal learning setups, followed by a detailed analysis of current methods, applications, and datasets, concluding with challenges and future directions.

CVJul 9, 2024
Learning to Complement and to Defer to Multiple Users

Zheng Zhang, Wenjie Ai, Kevin Wells et al.

With the development of Human-AI Collaboration in Classification (HAI-CC), integrating users and AI predictions becomes challenging due to the complex decision-making process. This process has three options: 1) AI autonomously classifies, 2) learning to complement, where AI collaborates with users, and 3) learning to defer, where AI defers to users. Despite their interconnected nature, these options have been studied in isolation rather than as components of a unified system. In this paper, we address this weakness with the novel HAI-CC methodology, called Learning to Complement and to Defer to Multiple Users (LECODU). LECODU not only combines learning to complement and learning to defer strategies, but it also incorporates an estimation of the optimal number of users to engage in the decision process. The training of LECODU maximises classification accuracy and minimises collaboration costs associated with user involvement. Comprehensive evaluations across real-world and synthesized datasets demonstrate LECODU's superior performance compared to state-of-the-art HAI-CC methods. Remarkably, even when relying on unreliable users with high rates of label noise, LECODU exhibits significant improvement over both human decision-makers alone and AI alone.

CVSep 19, 2024
A Novel Perspective for Multi-modal Multi-label Skin Lesion Classification

Yuan Zhang, Yutong Xie, Hu Wang et al.

The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e., clinical+dermoscopic images, and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem, overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification, we introduce a multi-head attention (MHA) module to learn multi-label correlations, complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.

LGNov 14, 2024Code
Rethinking Weight-Averaged Model-merging

Hu Wang, Congbo Ma, Ibrahim Almakky et al.

Model merging, particularly through weight averaging, has shown surprising effectiveness in saving computations and improving model performance without any additional training. However, the interpretability of why and how this technique works remains unclear. In this work, we reinterpret weight-averaged model merging through the lens of interpretability and provide empirical insights into the underlying mechanisms that govern its behavior. We approach the problem from three perspectives: (1) we analyze the learned weight structures and demonstrate that model weights encode structured representations that help explain the compatibility of weight averaging; (2) we compare averaging in weight space and feature space across diverse model architectures (CNNs and ViTs) and datasets, aiming to expose under which circumstances what combination paradigm will work more effectively; (3) we study the effect of parameter scaling on prediction stability, highlighting how weight averaging acts as a form of regularization that contributes to robustness. By framing these analyses in an interpretability context, our work contributes to a more transparent and systematic understanding of model merging for stakeholders interested in the safety and reliability of untrained model combination methods. The code is available at https://github.com/billhhh/Rethink-Merge.

CVNov 22, 2023
Learning to Complement with Multiple Humans

Zheng Zhang, Cuong Nguyen, Kevin Wells et al.

Real-world image classification tasks tend to be complex, where expert labellers are sometimes unsure about the classes present in the images, leading to the issue of learning with noisy labels (LNL). The ill-posedness of the LNL task requires the adoption of strong assumptions or the use of multiple noisy labels per training image, resulting in accurate models that work well in isolation but fail to optimise human-AI collaborative classification (HAI-CC). Unlike such LNL methods, HAI-CC aims to leverage the synergies between human expertise and AI capabilities but requires clean training labels, limiting its real-world applicability. This paper addresses this gap by introducing the innovative Learning to Complement with Multiple Humans (LECOMH) approach. LECOMH is designed to learn from noisy labels without depending on clean labels, simultaneously maximising collaborative accuracy while minimising the cost of human collaboration, measured by the number of human expert annotations required per image. Additionally, new benchmarks featuring multiple noisy labels for both training and testing are proposed to evaluate HAI-CC methods. Through quantitative comparisons on these benchmarks, LECOMH consistently outperforms competitive HAI-CC approaches, human labellers, multi-rater learning, and noisy-label learning methods across various datasets, offering a promising solution for addressing real-world image classification challenges.

CVMay 12, 2024Code
Meta-Learned Modality-Weighted Knowledge Distillation for Robust Multi-Modal Learning with Missing Data

Hu Wang, Salma Hassan, Yuyuan Liu et al.

In multi-modal learning, some modalities are more influential than others, and their absence can have a significant impact on classification/segmentation accuracy. Addressing this challenge, we propose a novel approach called Meta-learned Modality-weighted Knowledge Distillation (MetaKD), which enables multi-modal models to maintain high accuracy even when key modalities are missing. MetaKD adaptively estimates the importance weight of each modality through a meta-learning process. These learned importance weights guide a pairwise modality-weighted knowledge distillation process, allowing high-importance modalities to transfer knowledge to lower-importance ones, resulting in robust performance despite missing inputs. Unlike previous methods in the field, which are often task-specific and require significant modifications, our approach is designed to work in multiple tasks (e.g., segmentation and classification) with minimal adaptation. Experimental results on five prevalent datasets, including three Brain Tumor Segmentation datasets (BraTS2018, BraTS2019 and BraTS2020), the Alzheimer's Disease Neuroimaging Initiative (ADNI) classification dataset and the Audiovision-MNIST classification dataset, demonstrate the proposed model is able to outperform the compared models by a large margin. The code is available at https://github.com/billhhh/MetaKD.

HCOct 22, 2025Code
Learning To Defer To A Population With Limited Demonstrations

Nilesh Ramgolam, Gustavo Carneiro, Hsiang-Ting Chen

This paper addresses the critical data scarcity that hinders the practical deployment of learning to defer (L2D) systems to the population. We introduce a context-aware, semi-supervised framework that uses meta-learning to generate expert-specific embeddings from only a few demonstrations. We demonstrate the efficacy of a dual-purpose mechanism, where these embeddings are used first to generate a large corpus of pseudo-labels for training, and subsequently to enable on-the-fly adaptation to new experts at test-time. The experiment results on three different datasets confirm that a model trained on these synthetic labels rapidly approaches oracle-level performance, validating the data efficiency of our approach. By resolving a key training bottleneck, this work makes adaptive L2D systems more practical and scalable, paving the way for human-AI collaboration in real-world environments. To facilitate reproducibility and address implementation details not covered in the main text, we provide our source code and training configurations at https://github.com/nil123532/learning-to-defer-to-a-population-with-limited-demonstrations.

CVJul 28, 2025Code
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Ao Li, Yuxiang Duan, Jinghui Zhang et al.

Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

CVJun 1, 2025Code
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang et al.

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.