Mehrtash Harandi

CV
h-index40
111papers
5,233citations
Novelty52%
AI Score60

111 Papers

CVJul 21, 2023Code
EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos

Ruyi Zha, Xuelian Cheng, Hongdong Li et al. · ibm-research

Reconstructing soft tissues from stereo endoscope videos is an essential prerequisite for many medical applications. Previous methods struggle to produce high-quality geometry and appearance due to their inadequate representations of 3D scenes. To address this issue, we propose a novel neural-field-based method, called EndoSurf, which effectively learns to represent a deforming surface from an RGBD sequence. In EndoSurf, we model surface dynamics, shape, and texture with three neural fields. First, 3D points are transformed from the observed space to the canonical space using the deformation field. The signed distance function (SDF) field and radiance field then predict their SDFs and colors, respectively, with which RGBD images can be synthesized via differentiable volume rendering. We constrain the learned shape by tailoring multiple regularization strategies and disentangling geometry and appearance. Experiments on public endoscope datasets demonstrate that EndoSurf significantly outperforms existing solutions, particularly in reconstructing high-fidelity shapes. Code is available at https://github.com/Ruyi-Zha/endosurf.git.

CVMar 14, 2022
Implicit Motion Handling for Video Camouflaged Object Detection

Xuelian Cheng, Huan Xiong, Deng-Ping Fan et al. · ibm-research

We propose a new video camouflaged object detection (VCOD) framework that can exploit both short-term dynamics and long-term temporal consistency to detect camouflaged objects from video frames. An essential property of camouflaged objects is that they usually exhibit patterns similar to the background and thus make them hard to identify from still images. Therefore, effectively handling temporal dynamics in videos becomes the key for the VCOD task as the camouflaged objects will be noticeable when they move. However, current VCOD methods often leverage homography or optical flows to represent motions, where the detection error may accumulate from both the motion estimation error and the segmentation error. On the other hand, our method unifies motion estimation and object segmentation within a single optimization framework. Specifically, we build a dense correlation volume to implicitly capture motions between neighbouring frames and utilize the final segmentation supervision to optimize the implicit motion estimation and segmentation jointly. Furthermore, to enforce temporal consistency within a video sequence, we jointly utilize a spatio-temporal transformer to refine the short-term predictions. Extensive experiments on VCOD benchmarks demonstrate the architectural effectiveness of our approach. We also provide a large-scale VCOD dataset named MoCA-Mask with pixel-level handcrafted ground-truth masks and construct a comprehensive VCOD benchmark with previous methods to facilitate research in this direction. Dataset Link: https://xueliancheng.github.io/SLT-Net-project.

IVSep 16, 2022Code
Hybrid Window Attention Based Transformer Architecture for Brain Tumor Segmentation

Himashi Peiris, Munawar Hayat, Zhaolin Chen et al.

As intensities of MRI volumes are inconsistent across institutes, it is essential to extract universal features of multi-modal MRIs to precisely segment brain tumors. In this concept, we propose a volumetric vision transformer that follows two windowing strategies in attention for extracting fine features and local distributional smoothness (LDS) during model training inspired by virtual adversarial training (VAT) to make the model robust. We trained and evaluated network architecture on the FeTS Challenge 2022 dataset. Our performance on the online validation dataset is as follows: Dice Similarity Score of 81.71%, 91.38% and 85.40%; Hausdorff Distance (95%) of 14.81 mm, 3.93 mm, 11.18 mm for the enhancing tumor, whole tumor, and tumor core, respectively. Overall, the experimental results verify our method's effectiveness by yielding better performance in segmentation accuracy for each tumor sub-region. Our code implementation is publicly available : https://github.com/himashi92/vizviva_fets_2022

CVSep 30, 2023Code
NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation

Minh-Tuan Tran, Trung Le, Xuan-May Le et al.

Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data. Nonetheless, existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. Consequently, these models struggle to effectively map this noise to the ground-truth sample distribution, resulting in prolonging training times and low-quality outputs. In this paper, we propose a novel Noisy Layer Generation method (NAYER) which relocates the random source from the input to a noisy layer and utilizes the meaningful constant label-text embedding (LTE) as the input. LTE is generated by using the language model once, and then it is stored in memory for all subsequent training processes. The significance of LTE lies in its ability to contain substantial meaningful inter-class information, enabling the generation of high-quality samples with only a few training steps. Simultaneously, the noisy layer plays a key role in addressing the issue of diversity in sample generation by preventing the model from overemphasizing the constrained label information. By reinitializing the noisy layer in each iteration, we aim to facilitate the generation of diverse samples while still retaining the method's efficiency, thanks to the ease of learning provided by LTE. Experiments carried out on multiple datasets demonstrate that our NAYER not only outperforms the state-of-the-art methods but also achieves speeds 5 to 15 times faster than previous approaches. The code is available at https://github.com/tmtuan1307/nayer.

CVJul 31, 2023Code
Subspace Distillation for Continual Learning

Kaushik Roy, Christian Simon, Peyman Moghadam et al.

An ultimate objective in continual learning is to preserve knowledge learned in preceding tasks while learning new tasks. To mitigate forgetting prior knowledge, we propose a novel knowledge distillation technique that takes into the account the manifold structure of the latent/output space of a neural network in learning novel tasks. To achieve this, we propose to approximate the data manifold up-to its first order, hence benefiting from linear subspaces to model the structure and maintain the knowledge of a neural network while learning novel concepts. We demonstrate that the modeling with subspaces provides several intriguing properties, including robustness to noise and therefore effective for mitigating Catastrophic Forgetting in continual learning. We also discuss and show how our proposed method can be adopted to address both classification and segmentation problems. Empirically, we observe that our proposed method outperforms various continual learning methods on several challenging datasets including Pascal VOC, and Tiny-Imagenet. Furthermore, we show how the proposed method can be seamlessly combined with existing learning approaches to improve their performances. The codes of this article will be available at https://github.com/csiro-robotics/SDCL.

CVDec 5, 2022Code
PointCaM: Cut-and-Mix for Open-Set Point Cloud Learning

Jie Hong, Shi Qiu, Weihao Li et al.

Point cloud learning is receiving increasing attention, however, most existing point cloud models lack the practical ability to deal with the unavoidable presence of unknown objects. This paper mainly discusses point cloud learning under open-set settings, where we train the model without data from unknown classes and identify them in the inference stage. Basically, we propose to solve open-set point cloud learning using a novel Point Cut-and-Mix mechanism consisting of Unknown-Point Simulator and Unknown-Point Estimator modules. Specifically, we use the Unknown-Point Simulator to simulate out-of-distribution data in the training stage by manipulating the geometric context of partial known data. Based on this, the Unknown-Point Estimator module learns to exploit the point cloud's feature context for discriminating the known and unknown data. Extensive experiments show the plausibility of open-set point cloud learning and the effectiveness of our proposed solutions. Our code is available at \url{https://github.com/ShiQiu0419/pointcam}.

LGJul 31, 2023Code
L3DMC: Lifelong Learning using Distillation via Mixed-Curvature Space

Kaushik Roy, Peyman Moghadam, Mehrtash Harandi

The performance of a lifelong learning (L3) model degrades when it is trained on a series of tasks, as the geometrical formation of the embedding space changes while learning novel concepts sequentially. The majority of existing L3 approaches operate on a fixed-curvature (e.g., zero-curvature Euclidean) space that is not necessarily suitable for modeling the complex geometric structure of data. Furthermore, the distillation strategies apply constraints directly on low-dimensional embeddings, discouraging the L3 model from learning new concepts by making the model highly stable. To address the problem, we propose a distillation strategy named L3DMC that operates on mixed-curvature spaces to preserve the already-learned knowledge by modeling and maintaining complex geometrical structures. We propose to embed the projected low dimensional embedding of fixed-curvature spaces (Euclidean and hyperbolic) to higher-dimensional Reproducing Kernel Hilbert Space (RKHS) using a positive-definite kernel function to attain rich representation. Afterward, we optimize the L3 model by minimizing the discrepancies between the new sample representation and the subspace constructed using the old representation in RKHS. L3DMC is capable of adapting new knowledge better without forgetting old knowledge as it combines the representation power of multiple fixed-curvature spaces and is performed on higher-dimensional RKHS. Thorough experiments on three benchmarks demonstrate the effectiveness of our proposed distillation strategy for medical image classification in L3 settings. Our code implementation is publicly available at https://github.com/csiro-robotics/L3DMC.

CVAug 24, 2023
Hyperbolic Audio-visual Zero-shot Learning

Jie Hong, Zeeshan Hayder, Junlin Han et al. · oxford

Audio-visual zero-shot learning aims to classify samples consisting of a pair of corresponding audio and video sequences from classes that are not present during training. An analysis of the audio-visual data reveals a large degree of hyperbolicity, indicating the potential benefit of using a hyperbolic transformation to achieve curvature-aware geometric learning, with the aim of exploring more complex hierarchical data structures for this task. The proposed approach employs a novel loss function that incorporates cross-modality alignment between video and audio features in the hyperbolic space. Additionally, we explore the use of multiple adaptive curvatures for hyperbolic projections. The experimental results on this very challenging task demonstrate that our proposed hyperbolic approach for zero-shot learning outperforms the SOTA method on three datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL achieving a harmonic mean (HM) improvement of around 3.0%, 7.0%, and 5.3%, respectively.

CVMar 23, 2022
GOSS: Towards Generalized Open-set Semantic Segmentation

Jie Hong, Weihao Li, Junlin Han et al. · oxford

In this paper, we present and study a new image segmentation task, called Generalized Open-set Semantic Segmentation (GOSS). Previously, with the well-known open-set semantic segmentation (OSS), the intelligent agent only detects the unknown regions without further processing, limiting their perception of the environment. It stands to reason that a further analysis of the detected unknown pixels would be beneficial. Therefore, we propose GOSS, which unifies the abilities of two well-defined segmentation tasks, OSS and generic segmentation (GS), in a holistic way. Specifically, GOSS classifies pixels as belonging to known classes, and clusters (or groups) of pixels of unknown class are labelled as such. To evaluate this new expanded task, we further propose a metric which balances the pixel classification and clustering aspects. Moreover, we build benchmark tests on top of existing datasets and propose a simple neural architecture as a baseline, which jointly predicts pixel classification and clustering under open-set settings. Our experiments on multiple benchmarks demonstrate the effectiveness of our baseline. We believe our new GOSS task can produce an expressive image understanding for future research. Code will be made available.

CVJul 25, 2022
Deep Laparoscopic Stereo Matching with Transformers

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi et al. · ibm-research

The self-attention mechanism, successfully employed with the transformer structure is shown promise in many computer vision tasks including image recognition, and object detection. Despite the surge, the use of the transformer for the problem of stereo matching remains relatively unexplored. In this paper, we comprehensively investigate the use of the transformer for the problem of stereo matching, especially for laparoscopic videos, and propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design. To be specific, we investigate several ways to introduce transformers to volumetric stereo matching pipelines by analyzing the loss landscape of the designs and in-domain/cross-domain accuracy. Our analysis suggests that employing transformers for feature representation learning, while using CNNs for cost aggregation will lead to faster convergence, higher accuracy and better generalization than other options. Our extensive experiments on Sceneflow, SCARED2019 and dVPN datasets demonstrate the superior performance of our HybridStereoNet.

CVAug 2, 2022
Curved Geometric Networks for Visual Anomaly Recognition

Jie Hong, Pengfei Fang, Weihao Li et al. · oxford

Learning a latent embedding to understand the underlying nature of data distribution is often formulated in Euclidean spaces with zero curvature. However, the success of the geometry constraints, posed in the embedding space, indicates that curved spaces might encode more structural information, leading to better discriminative power and hence richer representations. In this work, we investigate benefits of the curved space for analyzing anomalies or out-of-distribution objects in data. This is achieved by considering embeddings via three geometry constraints, namely, spherical geometry (with positive curvature), hyperbolic geometry (with negative curvature) or mixed geometry (with both positive and negative curvatures). Three geometric constraints can be chosen interchangeably in a unified design given the task at hand. Tailored for the embeddings in the curved space, we also formulate functions to compute the anomaly score. Two types of geometric modules (i.e., Geometric-in-One and Geometric-in-Two models) are proposed to plug in the original Euclidean classifier, and anomaly scores are computed from the curved embeddings. We evaluate the resulting designs under a diverse set of visual recognition scenarios, including image detection (multi-class OOD detection and one-class anomaly detection) and segmentation (multi-class anomaly segmentation and one-class anomaly segmentation). The empirical results show the effectiveness of our proposal through the consistent improvement over various scenarios.

CVOct 9, 2022Code
A Differentiable Distance Approximation for Fairer Image Classification

Nicholas Rosa, Tom Drummond, Mehrtash Harandi

Naively trained AI models can be heavily biased. This can be particularly problematic when the biases involve legally or morally protected attributes such as ethnic background, age or gender. Existing solutions to this problem come at the cost of extra computation, unstable adversarial optimisation or have losses on the feature space structure that are disconnected from fairness measures and only loosely generalise to fairness. In this work we propose a differentiable approximation of the variance of demographics, a metric that can be used to measure the bias, or unfairness, in an AI model. Our approximation can be optimised alongside the regular training objective which eliminates the need for any extra models during training and directly improves the fairness of the regularised models. We demonstrate that our approach improves the fairness of AI models in varied task and dataset scenarios, whilst still maintaining a high level of classification accuracy. Code is available at https://bitbucket.org/nelliottrosa/base_fairness.

CVJun 15, 2022
Rethinking Generalization in Few-Shot Classification

Markus Hiller, Rongkai Ma, Mehrtash Harandi et al.

Single image-level annotations only correctly describe an often small subset of an image's content, particularly when complex real-world scenes are depicted. While this might be acceptable in many classification scenarios, it poses a significant challenge for applications where the set of classes differs significantly between training and test time. In this paper, we take a closer look at the implications in the context of $\textit{few-shot learning}$. Splitting the input samples into patches and encoding these via the help of Vision Transformers allows us to establish semantic correspondences between local regions across images and independent of their respective class. The most informative patch embeddings for the task at hand are then determined as a function of the support set via online optimization at inference time, additionally providing visual interpretability of `$\textit{what matters most}$' in the image. We build on recent advances in unsupervised training of networks via masked image modelling to overcome the lack of fine-grained labels and learn the more general statistical structure of the data while avoiding negative image-level annotation influence, $\textit{aka}$ supervision collapse. Experimental results show the competitiveness of our approach, achieving new state-of-the-art results on four popular few-shot classification benchmarks for $5$-shot and $1$-shot scenarios.

LGMar 8, 2022
On Generalizing Beyond Domains in Cross-Domain Continual Learning

Christian Simon, Masoud Faraki, Yi-Hsuan Tsai et al.

Humans have the ability to accumulate knowledge of new tasks in varying conditions, but deep neural networks often suffer from catastrophic forgetting of previously learned knowledge after learning a new task. Many recent methods focus on preventing catastrophic forgetting under the assumption of train and test data following similar distributions. In this work, we consider a more realistic scenario of continual learning under domain shifts where the model must generalize its inference to an unseen domain. To this end, we encourage learning semantically meaningful features by equipping the classifier with class similarity metrics as learning parameters which are obtained through Mahalanobis similarity computations. Learning of the backbone representation along with these extra parameters is done seamlessly in an end-to-end manner. In addition, we propose an approach based on the exponential moving average of the parameters for better knowledge distillation. We demonstrate that, to a great extent, existing continual learning algorithms fail to handle the forgetting issue under multiple distributions, while our proposed approach learns new tasks under domain shift with accuracy boosts up to 10% on challenging datasets such as DomainNet and OfficeHome.

96.2LGMar 18Code
Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization

Haocheng Luo, Zehang Deng, Thanh-Toan Do et al.

Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dynamics in logit space. Our analysis reveals that negative-gradient updates cause residuals to expand rapidly along high-curvature directions, which underlies the squeezing effect, whereas Sharpness-Aware Minimization (SAM) can suppress this behavior through its curvature-regularization effect. Building on this insight, we investigate logits-SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia-2.8B, Mistral-7B, and Gemma-2B-IT across multiple datasets and benchmarks demonstrate that logits-SAM consistently improves the effectiveness of DPO and integrates seamlessly with other DPO variants. Code is available at https://github.com/RitianLuo/logits-sam-dpo.

IVJun 1, 2023
CL-MRI: Self-Supervised Contrastive Learning to Improve the Accuracy of Undersampled MRI Reconstruction

Mevan Ekanayake, Zhifeng Chen, Mehrtash Harandi et al.

In Magnetic Resonance Imaging (MRI), image acquisitions are often undersampled in the measurement domain to accelerate the scanning process, at the expense of image quality. However, image quality is a crucial factor that influences the accuracy of clinical diagnosis; hence, high-quality image reconstruction from undersampled measurements has been a key area of research. Recently, deep learning (DL) methods have emerged as the state-of-the-art for MRI reconstruction, typically involving deep neural networks to transform undersampled MRI images into high-quality MRI images through data-driven processes. Nevertheless, there is clear and significant room for improvement in undersampled DL MRI reconstruction to meet the high standards required for clinical diagnosis, in terms of eliminating aliasing artifacts and reducing image noise. In this paper, we introduce a self-supervised pretraining procedure using contrastive learning to improve the accuracy of undersampled DL MRI reconstruction. We use contrastive learning to transform the MRI image representations into a latent space that maximizes mutual information among different undersampled representations and optimizes the information content at the input of the downstream DL reconstruction models. Our experiments demonstrate improved reconstruction accuracy across a range of acceleration factors and datasets, both quantitatively and qualitatively. Furthermore, our extended experiments validate the proposed framework's robustness under adversarial conditions, such as measurement noise, different k-space sampling patterns, and pathological abnormalities, and also prove the transfer learning capabilities on MRI datasets with completely different anatomy. Additionally, we conducted experiments to visualize and analyze the properties of the proposed MRI contrastive learning latent space.

CVApr 8, 2023
Exploring Data Geometry for Continual Learning

Zhi Gao, Chen Xu, Feng Li et al.

Continual learning aims to efficiently learn from a non-stationary stream of data while avoiding forgetting the knowledge of old data. In many practical applications, data complies with non-Euclidean geometry. As such, the commonly used Euclidean space cannot gracefully capture non-Euclidean geometric structures of data, leading to inferior results. In this paper, we study continual learning from a novel perspective by exploring data geometry for the non-stationary stream of data. Our method dynamically expands the geometry of the underlying space to match growing geometric structures induced by new data, and prevents forgetting by keeping geometric structures of old data into account. In doing so, making use of the mixed curvature space, we propose an incremental search scheme, through which the growing geometric structures are encoded. Then, we introduce an angular-regularization loss and a neighbor-robustness loss to train the model, capable of penalizing the change of global geometric structures and local geometric structures. Experiments show that our method achieves better performance than baseline methods designed in Euclidean space.

LGSep 13, 2022
Concealing Sensitive Samples against Gradient Leakage in Federated Learning

Jing Wu, Munawar Hayat, Mingyi Zhou et al.

Federated Learning (FL) is a distributed learning paradigm that enhances users privacy by eliminating the need for clients to share raw, private data with the server. Despite the success, recent studies expose the vulnerability of FL to model inversion attacks, where adversaries reconstruct users private data via eavesdropping on the shared gradient information. We hypothesize that a key factor in the success of such attacks is the low entanglement among gradients per data within the batch during stochastic optimization. This creates a vulnerability that an adversary can exploit to reconstruct the sensitive data. Building upon this insight, we present a simple, yet effective defense strategy that obfuscates the gradients of the sensitive data with concealed samples. To achieve this, we propose synthesizing concealed samples to mimic the sensitive data at the gradient level while ensuring their visual dissimilarity from the actual sensitive data. Compared to the previous art, our empirical evaluations suggest that the proposed technique provides the strongest protection while simultaneously maintaining the FL performance.

CVOct 5, 2023
Continual Test-time Domain Adaptation via Dynamic Sample Selection

Yanshuo Wang, Jie Hong, Ali Cheraghian et al.

The objective of Continual Test-time Domain Adaptation (CTDA) is to gradually adapt a pre-trained model to a sequence of target domains without accessing the source data. This paper proposes a Dynamic Sample Selection (DSS) method for CTDA. DSS consists of dynamic thresholding, positive learning, and negative learning processes. Traditionally, models learn from unlabeled unknown environment data and equally rely on all samples' pseudo-labels to update their parameters through self-training. However, noisy predictions exist in these pseudo-labels, so all samples are not equally trustworthy. Therefore, in our method, a dynamic thresholding module is first designed to select suspected low-quality from high-quality samples. The selected low-quality samples are more likely to be wrongly predicted. Therefore, we apply joint positive and negative learning on both high- and low-quality samples to reduce the risk of using wrong information. We conduct extensive experiments that demonstrate the effectiveness of our proposed method for CTDA in the image domain, outperforming the state-of-the-art results. Furthermore, our approach is also evaluated in the 3D point cloud domain, showcasing its versatility and potential for broader applicability.

LGJun 15, 2022
On Enforcing Better Conditioned Meta-Learning for Rapid Few-Shot Adaptation

Markus Hiller, Mehrtash Harandi, Tom Drummond

Inspired by the concept of preconditioning, we propose a novel method to increase adaptation speed for gradient-based meta-learning methods without incurring extra parameters. We demonstrate that recasting the optimization problem to a non-linear least-squares formulation provides a principled way to actively enforce a $\textit{well-conditioned}$ parameter space for meta-learning models based on the concepts of the condition number and local curvature. Our comprehensive evaluations show that the proposed method significantly outperforms its unconstrained counterpart especially during initial adaptation steps, while achieving comparable or better overall results on several few-shot classification tasks -- creating the possibility of dynamically choosing the number of adaptation steps at inference time.

LGFeb 12, 2023
Vector Quantized Wasserstein Auto-Encoder

Tung-Long Vuong, Trung Le, He Zhao et al.

Learning deep discrete latent presentations offers a promise of better symbolic and summarized abstractions that are more useful to subsequent downstream tasks. Inspired by the seminal Vector Quantized Variational Auto-Encoder (VQ-VAE), most of work in learning deep discrete representations has mainly focused on improving the original VQ-VAE form and none of them has studied learning deep discrete representations from the generative viewpoint. In this work, we study learning deep discrete representations from the generative viewpoint. Specifically, we endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution via minimizing a WS distance between them. We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution. Finally, we empirically evaluate our method on several well-known benchmarks, where it achieves better qualitative and quantitative performances than the other VQ-VAE variants in terms of the codebook utilization and image reconstruction/generation.

CVApr 21, 2023
Hyperbolic Geometry in Computer Vision: A Survey

Pengfei Fang, Mehrtash Harandi, Trung Le et al.

Hyperbolic geometry, a Riemannian manifold endowed with constant sectional negative curvature, has been considered an alternative embedding space in many learning scenarios, \eg, natural language processing, graph learning, \etc, as a result of its intriguing property of encoding the data's hierarchical structure (like irregular graph or tree-likeness data). Recent studies prove that such data hierarchy also exists in the visual dataset, and investigate the successful practice of hyperbolic geometry in the computer vision (CV) regime, ranging from the classical image classification to advanced model adaptation learning. This paper presents the first and most up-to-date literature review of hyperbolic spaces for CV applications. To this end, we first introduce the background of hyperbolic geometry, followed by a comprehensive investigation of algorithms, with geometric prior of hyperbolic space, in the context of visual applications. We also conclude this manuscript and identify possible future directions.

CVJul 16, 2024
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Pengxiang Li, Zhi Gao, Bofei Zhang et al.

Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.

IVJul 18, 2022
Multi-branch Cascaded Swin Transformers with Attention to k-space Sampling Pattern for Accelerated MRI Reconstruction

Mevan Ekanayake, Kamlesh Pawar, Mehrtash Harandi et al.

Global correlations are widely seen in human anatomical structures due to similarity across tissues and bones. These correlations are reflected in magnetic resonance imaging (MRI) scans as a result of close-range proton density and T1/T2 parameters. Furthermore, to achieve accelerated MRI, k-space data are undersampled which causes global aliasing artifacts. Convolutional neural network (CNN) models are widely utilized for accelerated MRI reconstruction, but those models are limited in capturing global correlations due to the intrinsic locality of the convolution operation. The self-attention-based transformer models are capable of capturing global correlations among image features, however, the current contributions of transformer models for MRI reconstruction are minute. The existing contributions mostly provide CNN-transformer hybrid solutions and rarely leverage the physics of MRI. In this paper, we propose a physics-based stand-alone (convolution free) transformer model titled, the Multi-head Cascaded Swin Transformers (McSTRA) for accelerated MRI reconstruction. McSTRA combines several interconnected MRI physics-related concepts with the transformer networks: it exploits global MR features via the shifted window self-attention mechanism; it extracts MR features belonging to different spectral components separately using a multi-head setup; it iterates between intermediate de-aliasing and k-space correction via a cascaded network with data consistency in k-space and intermediate loss computations; furthermore, we propose a novel positional embedding generation mechanism to guide self-attention utilizing the point spread function corresponding to the undersampling mask. Our model significantly outperforms state-of-the-art MRI reconstruction methods both visually and quantitatively while depicting improved resolution and removal of aliasing artifacts.

CVSep 14, 2022
Learning Deep Optimal Embeddings with Sinkhorn Divergences

Soumava Kumar Roy, Yan Han, Mehrtash Harandi et al.

Deep Metric Learning algorithms aim to learn an efficient embedding space to preserve the similarity relationships among the input data. Whilst these algorithms have achieved significant performance gains across a wide plethora of tasks, they have also failed to consider and increase comprehensive similarity constraints; thus learning a sub-optimal metric in the embedding space. Moreover, up until now; there have been few studies with respect to their performance in the presence of noisy labels. Here, we address the concern of learning a discriminative deep embedding space by designing a novel, yet effective Deep Class-wise Discrepancy Loss (DCDL) function that segregates the underlying similarity distributions (thus introducing class-wise discrepancy) of the embedding points between each and every class. Our empirical results across three standard image classification datasets and two fine-grained image recognition datasets in the presence and absence of noise clearly demonstrate the need for incorporating such class-wise similarity relationships along with traditional algorithms while learning a discriminative embedding space.

CVOct 19, 2022
LAVA: Label-efficient Visual Learning and Adaptation

Islam Nassar, Munawar Hayat, Ehsan Abbasnejad et al.

We present LAVA, a simple yet effective method for multi-domain visual transfer learning with limited data. LAVA builds on a few recent innovations to enable adapting to partially labelled datasets with class and domain shifts. First, LAVA learns self-supervised visual representations on the source dataset and ground them using class label semantics to overcome transfer collapse problems associated with supervised pretraining. Secondly, LAVA maximises the gains from unlabelled target data via a novel method which uses multi-crop augmentations to obtain highly robust pseudo-labels. By combining these ingredients, LAVA achieves a new state-of-the-art on ImageNet semi-supervised protocol, as well as on 7 out of 10 datasets in multi-domain few-shot learning on the Meta-dataset. Code and models are made available.

IVSep 2, 2024
SeCo-INR: Semantically Conditioned Implicit Neural Representations for Improved Medical Image Super-Resolution

Mevan Ekanayake, Zhifeng Chen, Gary Egan et al.

Implicit Neural Representations (INRs) have recently advanced the field of deep learning due to their ability to learn continuous representations of signals without the need for large training datasets. Although INR methods have been studied for medical image super-resolution, their adaptability to localized priors in medical images has not been extensively explored. Medical images contain rich anatomical divisions that could provide valuable local prior information to enhance the accuracy and robustness of INRs. In this work, we propose a novel framework, referred to as the Semantically Conditioned INR (SeCo-INR), that conditions an INR using local priors from a medical image, enabling accurate model fitting and interpolation capabilities to achieve super-resolution. Our framework learns a continuous representation of the semantic segmentation features of a medical image and utilizes it to derive the optimal INR for each semantic region of the image. We tested our framework using several medical imaging modalities and achieved higher quantitative scores and more realistic super-resolution outputs compared to state-of-the-art methods.

LGSep 29, 2023
Sharpness-Aware Teleportation on Riemannian Manifolds

Tuan Truong, Hoang-Phi Nguyen, Haocheng Luo et al.

Recent studies highlight the effectiveness of flat minima in enhancing generalization, with sharpness-aware minimization (SAM) achieving state-of-the-art performance. Additionally, insights into the intrinsic geometry of the loss landscape have shown promise for improving model generalization. Building on these advancements, we introduce a novel sharpness-aware, geometry-aware teleportation mechanism to further enhance robustness and generalization. The core innovation of our approach is to decompose each iteration into a teleportation step within a local orbit and a sharpness-aware step that transitions between different orbits, leveraging the Riemannian quotient manifold. Our approach is grounded in a theoretical framework that analyzes the generalization gap between population loss and worst-case empirical loss within the context of Riemannian manifolds. To demonstrate the effectiveness of our method, we evaluate and compare our algorithm on diverse vision benchmarks with various datasets and Riemannian manifolds.

CVMar 21, 2024Code
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression

Yihang Chen, Qianyi Wu, Weiyao Lin et al.

3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To address this, we make use of the relations between the unorganized anchors and the structured hash grid, leveraging their mutual information for context modeling, and propose a Hash-grid Assisted Context (HAC) framework for highly compact 3DGS representation. Our approach introduces a binary hash grid to establish continuous spatial consistencies, allowing us to unveil the inherent spatial relations of anchors through a carefully designed context model. To facilitate entropy coding, we utilize Gaussian distributions to accurately estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Additionally, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Importantly, our work is the pioneer to explore context-based compression for 3DGS representation, resulting in a remarkable size reduction of over $75\times$ compared to vanilla 3DGS, while simultaneously improving fidelity, and achieving over $11\times$ size reduction over SOTA 3DGS compression approach Scaffold-GS. Our code is available here: https://github.com/YihangChen-ee/HAC

CVNov 22, 2022
Multimorbidity Content-Based Medical Image Retrieval Using Proxies

Yunyan Xing, Benjamin J. Meyer, Mehrtash Harandi et al.

Content-based medical image retrieval is an important diagnostic tool that improves the explainability of computer-aided diagnosis systems and provides decision making support to healthcare professionals. Medical imaging data, such as radiology images, are often multimorbidity; a single sample may have more than one pathology present. As such, image retrieval systems for the medical domain must be designed for the multi-label scenario. In this paper, we propose a novel multi-label metric learning method that can be used for both classification and content-based image retrieval. In this way, our model is able to support diagnosis by predicting the presence of diseases and provide evidence for these predictions by returning samples with similar pathological content to the user. In practice, the retrieved images may also be accompanied by pathology reports, further assisting in the diagnostic process. Our method leverages proxy feature vectors, enabling the efficient learning of a robust feature space in which the distance between feature vectors can be used as a measure of the similarity of those samples. Unlike existing proxy-based methods, training samples are able to assign to multiple proxies that span multiple class labels. This multi-label proxy assignment results in a feature space that encodes the complex relationships between diseases present in medical imaging data. Our method outperforms state-of-the-art image retrieval systems and a set of baseline approaches. We demonstrate the efficacy of our approach to both classification and content-based image retrieval on two multimorbidity radiology datasets.

96.3CVMar 29
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling

Minh-Tuan Tran, Xuan-May Le, Quan Hung Tran et al.

Existing generative models, such as diffusion and auto-regressive networks, are inherently static, relying on a fixed set of pretrained parameters to handle all inputs. In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. Composer generates input-conditioned parameter adaptations at inference time, which are injected into the pretrained model's weights, enabling per-input specialization without fine-tuning or retraining. Adaptation occurs once prior to multi-step generation, yielding higher-quality, context-aware outputs with minimal computational and memory overhead. Experiments show that Composer substantially improves performance across diverse generative models and use cases, including lightweight/quantized models and test-time scaling. By leveraging input-aware parameter composition, Composer establishes a new paradigm for designing generative models that dynamically adapt to each input, moving beyond static parameterization.

CVOct 31, 2025
Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Wu Wei, Xiaomeng Fan, Yuwei Wu et al.

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

LGJan 11, 2024Code
Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks

Jing Wu, Mehrtash Harandi

Machine unlearning has become a pivotal task to erase the influence of data from a trained model. It adheres to recent data regulation standards and enhances the privacy and security of machine learning applications. In this work, we present a new machine unlearning approach Scissorhands. Initially, Scissorhands identifies the most pertinent parameters in the given model relative to the forgetting data via connection sensitivity. By reinitializing the most influential top-k percent of these parameters, a trimmed model for erasing the influence of the forgetting data is obtained. Subsequently, Scissorhands fine-tunes the trimmed model with a gradient projection-based approach, seeking parameters that preserve information on the remaining data while discarding information related to the forgetting data. Our experimental results, conducted across image classification and image generation tasks, demonstrate that Scissorhands, showcases competitive performance when compared to existing methods. Source code is available at https://github.com/JingWu321/Scissorhands.

CVMar 21, 2024Code
Text-Enhanced Data-free Approach for Federated Class-Incremental Learning

Minh-Tuan Tran, Trung Le, Xuan-May Le et al.

Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue, involving the dynamic addition of new classes in the context of federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However, prior approaches lack the crucial synergy between DFKT and the model training phases, causing DFKT to encounter difficulties in generating high-quality data from a non-anchored latent space of the old task model. In this paper, we introduce LANDER (Label Text Centered Data-Free Knowledge Transfer) to address this issue by utilizing label text embeddings (LTE) produced by pretrained language models. Specifically, during the model training phase, our approach treats LTE as anchor points and constrains the feature embeddings of corresponding training samples around them, enriching the surrounding area with more meaningful information. In the DFKT phase, by using these LTE anchors, LANDER can synthesize more meaningful samples, thereby effectively addressing the forgetting problem. Additionally, instead of tightly constraining embeddings toward the anchor, the Bounding Loss is introduced to encourage sample embeddings to remain flexible within a defined radius. This approach preserves the natural differences in sample embeddings and mitigates the embedding overlap caused by heterogeneous federated settings. Extensive experiments conducted on CIFAR100, Tiny-ImageNet, and ImageNet demonstrate that LANDER significantly outperforms previous methods and achieves state-of-the-art performance in FCIL. The code is available at https://github.com/tmtuan1307/lander.

CVJan 21, 2025Code
HAC++: Towards 100X Compression of 3D Gaussian Splatting

Yihang Chen, Qianyi Wu, Weiyao Lin et al.

3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at https://github.com/YihangChen-ee/HAC-plus.

LGJan 22, 2025Code
Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization

Haocheng Luo, Tuan Truong, Tung Pham et al.

Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM's training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness, and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of second- and third-order terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM's effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, limiting SAM's efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at https://github.com/RitianLuo/EigenSAM.

CVMar 27, 2024Code
Backpropagation-free Network for 3D Test-time Adaptation

Yanshuo Wang, Ali Cheraghian, Zeeshan Hayder et al.

Real-world systems often encounter new data over time, which leads to experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods tend to apply computationally heavy and memory-intensive backpropagation-based approaches to handle this. Here, we propose a novel method that uses a backpropagation-free approach for TTA for the specific case of 3D data. Our model uses a two-stream architecture to maintain knowledge about the source domain as well as complementary target-domain-specific information. The backpropagation-free property of our model helps address the well-known forgetting problem and mitigates the error accumulation issue. The proposed method also eliminates the need for the usually noisy process of pseudo-labeling and reliance on costly self-supervised training. Moreover, our method leverages subspace learning, effectively reducing the distribution variance between the two domains. Furthermore, the source-domain-specific and the target-domain-specific streams are aligned using a novel entropy-based adaptive fusion strategy. Extensive experiments on popular benchmarks demonstrate the effectiveness of our method. The code will be available at \url{https://github.com/abie-e/BFTT3D}.

CVMar 11, 2025Code
PCGS: Progressive Compression of 3D Gaussian Splatting

Yihang Chen, Mengyao Li, Qianyi Wu et al.

3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have been proposed, they fail to efficiently utilize existing bitstreams in on-demand applications due to their lack of progressivity, leading to a waste of resource. To address this issue, we propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians (or anchors) to enable effective progressivity for on-demand applications. Specifically, for quantity, we introduce a progressive masking strategy that incrementally incorporates new anchors while refining existing ones to enhance fidelity. For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. Furthermore, to compact the incremental bitstreams, we leverage existing quantization results to refine probability prediction, improving entropy coding efficiency across progressive levels. Overall, PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods. Code available at: github.com/YihangChen-ee/PCGS.

CVNov 26, 2024Code
Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation

Minh-Tuan Tran, Trung Le, Xuan-May Le et al.

Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data. While DFKD methods have achieved success on smaller datasets like CIFAR10 and CIFAR100, they encounter challenges on larger, high-resolution datasets such as ImageNet. A primary issue with previous approaches is their generation of synthetic images at high resolutions (e.g., $224 \times 224$) without leveraging information from real images, often resulting in noisy images that lack essential class-specific features in large datasets. Additionally, the computational cost of generating the extensive data needed for effective knowledge transfer can be prohibitive. In this paper, we introduce MUlti-reSolution data-freE (MUSE) to address these limitations. MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features. To further enhance model diversity, we propose multi-resolution generation and embedding diversity techniques that strengthen latent space representations, leading to significant performance improvements. Experimental results demonstrate that MUSE achieves state-of-the-art performance across both small- and large-scale datasets, with notable performance gains of up to two digits in nearly all ImageNet and subset experiments. Code is available at https://github.com/tmtuan1307/muse.

CVJun 6, 2024Code
How Far Can We Compress Instant-NGP-Based NeRF?

Yihang Chen, Qianyi Wu, Mehrtash Harandi et al.

In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically, we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally, we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge, we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100$\times$ and 70$\times$ with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets, respectively. Additionally, we attain 86.7\% and 82.3\% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: https://github.com/YihangChen-ee/CNC.

IVJan 11, 2022Code
Reciprocal Adversarial Learning for Brain Tumor Segmentation: A Solution to BraTS Challenge 2021 Segmentation Task

Himashi Peiris, Zhaolin Chen, Gary Egan et al.

This paper proposes an adversarial learning based training approach for brain tumor segmentation task. In this concept, the 3D segmentation network learns from dual reciprocal adversarial learning approaches. To enhance the generalization across the segmentation predictions and to make the segmentation network robust, we adhere to the Virtual Adversarial Training approach by generating more adversarial examples via adding some noise on original patient data. By incorporating a critic that acts as a quantitative subjective referee, the segmentation network learns from the uncertainty information associated with segmentation results. We trained and evaluated network architecture on the RSNA-ASNR-MICCAI BraTS 2021 dataset. Our performance on the online validation dataset is as follows: Dice Similarity Score of 81.38%, 90.77% and 85.39%; Hausdorff Distance (95\%) of 21.83 mm, 5.37 mm, 8.56 mm for the enhancing tumor, whole tumor and tumor core, respectively. Similarly, our approach achieved a Dice Similarity Score of 84.55%, 90.46% and 85.30%, as well as Hausdorff Distance (95\%) of 13.48 mm, 6.32 mm and 16.98 mm on the final test dataset. Overall, our proposed approach yielded better performance in segmentation accuracy for each tumor sub-region. Our code implementation is publicly available at https://github.com/himashi92/vizviva_brats_2021

IVNov 26, 2021Code
A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation

Himashi Peiris, Munawar Hayat, Zhaolin Chen et al.

We propose a Transformer architecture for volumetric segmentation, a challenging task that requires keeping a complex balance in encoding local and global spatial cues, and preserving information along all axes of the volume. Encoder of the proposed design benefits from self-attention mechanism to simultaneously encode local and global cues, while the decoder employs a parallel self and cross attention formulation to capture fine details for boundary refinement. Empirically, we show that the proposed design choices result in a computationally efficient model, with competitive and promising results on the Medical Segmentation Decathlon (MSD) brain tumor segmentation (BraTS) Task. We further show that the representations learned by our model are robust against data corruptions. \href{https://github.com/himashi92/VT-UNet}{Our code implementation is publicly available}.

LGOct 13, 2021Code
Dense Uncertainty Estimation

Jing Zhang, Yuchao Dai, Mochu Xiang et al.

Deep neural networks can be roughly divided into deterministic neural networks and stochastic neural networks.The former is usually trained to achieve a mapping from input space to output space via maximum likelihood estimation for the weights, which leads to deterministic predictions during testing. In this way, a specific weights set is estimated while ignoring any uncertainty that may occur in the proper weight space. The latter introduces randomness into the framework, either by assuming a prior distribution over model parameters (i.e. Bayesian Neural Networks) or including latent variables (i.e. generative models) to explore the contribution of latent variables for model predictions, leading to stochastic predictions during testing. Different from the former that achieves point estimation, the latter aims to estimate the prediction distribution, making it possible to estimate uncertainty, representing model ignorance about its predictions. We claim that conventional deterministic neural network based dense prediction tasks are prone to overfitting, leading to over-confident predictions, which is undesirable for decision making. In this paper, we investigate stochastic neural networks and uncertainty estimation techniques to achieve both accurate deterministic prediction and reliable uncertainty estimation. Specifically, we work on two types of uncertainty estimations solutions, namely ensemble based methods and generative model based methods, and explain their pros and cons while using them in fully/semi/weakly-supervised framework. Due to the close connection between uncertainty estimation and model calibration, we also introduce how uncertainty estimation can be used for deep model calibration to achieve well-calibrated models, namely dense model calibration. Code and data are available at https://github.com/JingZhang617/UncertaintyEstimation.

CVAug 25, 2021Code
Duo-SegNet: Adversarial Dual-Views for Semi-Supervised Medical Image Segmentation

Himashi Peiris, Zhaolin Chen, Gary Egan et al.

Segmentation of images is a long-standing challenge in medical AI. This is mainly due to the fact that training a neural network to perform image segmentation requires a significant number of pixel-level annotated data, which is often unavailable. To address this issue, we propose a semi-supervised image segmentation technique based on the concept of multi-view learning. In contrast to the previous art, we introduce an adversarial form of dual-view training and employ a critic to formulate the learning problem in multi-view training as a min-max problem. Thorough quantitative and qualitative evaluations on several datasets indicate that our proposed method outperforms state-of-the-art medical image segmentation algorithms consistently and comfortably. The code is publicly available at https://github.com/himashi92/Duo-SegNet

CVOct 26, 2020Code
Hierarchical Neural Architecture Search for Deep Stereo Matching

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi et al.

To reduce the human efforts in neural network design, Neural Architecture Search (NAS) has been applied with remarkable success to various high-level vision tasks such as classification and semantic segmentation. The underlying idea for the NAS algorithm is straightforward, namely, to enable the network the ability to choose among a set of operations (e.g., convolution with different filter sizes), one is able to find an optimal architecture that is better adapted to the problem at hand. However, so far the success of NAS has not been enjoyed by low-level geometric vision tasks such as stereo matching. This is partly due to the fact that state-of-the-art deep stereo matching networks, designed by humans, are already sheer in size. Directly applying the NAS to such massive structures is computationally prohibitive based on the currently available mainstream computing resources. In this paper, we propose the first end-to-end hierarchical NAS framework for deep stereo matching by incorporating task-specific human knowledge into the neural architecture search framework. Specifically, following the gold standard pipeline for deep stereo matching (i.e., feature extraction -- feature volume construction and dense matching), we optimize the architectures of the entire pipeline jointly. Extensive experiments show that our searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks, as well as the top 1 on SceneFlow dataset with a substantial improvement on the size of the network and the speed of inference. The code is available at https://github.com/XuelianCheng/LEAStereo.

CVJan 11, 2024
Erasing Undesirable Influence in Diffusion Models

Jing Wu, Trung Le, Munawar Hayat et al.

Diffusion models are highly effective at generating high-quality images but pose risks, such as the unintentional generation of NSFW (not safe for work) content. Although various techniques have been proposed to mitigate unwanted influences in diffusion models while preserving overall performance, achieving a balance between these goals remains challenging. In this work, we introduce EraseDiff, an algorithm designed to preserve the utility of the diffusion model on retained data while removing the unwanted information associated with the data to be forgotten. Our approach formulates this task as a constrained optimization problem using the value function, resulting in a natural first-order algorithm for solving the optimization problem. By altering the generative process to deviate away from the ground-truth denoising trajectory, we update parameters for preservation while controlling constraint reduction to ensure effective erasure, striking an optimal trade-off. Extensive experiments and thorough comparisons with state-of-the-art algorithms demonstrate that EraseDiff effectively preserves the model's utility, efficacy, and efficiency.

CVNov 23, 2024
MUNBa: Machine Unlearning via Nash Bargaining

Jing Wu, Mehrtash Harandi

Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

CVMay 24, 2025
Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Chaofan Gan, Yuanpeng Tu, Xi Chen et al.

Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the \textbf{Di}ffusion \textbf{T}ransformer \textbf{F}eature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs. Specifically, DiTF leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation. Furthermore, a channel discard strategy is introduced to mitigate the adverse effects of massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

LGAug 24, 2025
Curvature Learning for Generalization of Hyperbolic Neural Networks

Xiaomeng Fan, Yuwei Wu, Zhi Gao et al.

Hyperbolic neural networks (HNNs) have demonstrated notable efficacy in representing real-world data with hierarchical structures via exploiting the geometric properties of hyperbolic spaces characterized by negative curvatures. Curvature plays a crucial role in optimizing HNNs. Inappropriate curvatures may cause HNNs to converge to suboptimal parameters, degrading overall performance. So far, the theoretical foundation of the effect of curvatures on HNNs has not been developed. In this paper, we derive a PAC-Bayesian generalization bound of HNNs, highlighting the role of curvatures in the generalization of HNNs via their effect on the smoothness of the loss landscape. Driven by the derived bound, we propose a sharpness-aware curvature learning method to smooth the loss landscape, thereby improving the generalization of HNNs. In our method, we design a scope sharpness measure for curvatures, which is minimized through a bi-level optimization process. Then, we introduce an implicit differentiation algorithm that efficiently solves the bi-level optimization by approximating gradients of curvatures. We present the approximation error and convergence analyses of the proposed method, showing that the approximation error is upper-bounded, and the proposed method can converge by bounding gradients of HNNs. Experiments on four settings: classification, learning from long-tailed data, learning from noisy data, and few-shot learning show that our method can improve the performance of HNNs.

CVJun 23, 2025
Geometry-aware Distance Measure for Diverse Hierarchical Structures in Hyperbolic Spaces

Pengxiang Li, Yuwei Wu, Zhi Gao et al.

Learning in hyperbolic spaces has attracted increasing attention due to its superior ability to model hierarchical structures of data. Most existing hyperbolic learning methods use fixed distance measures for all data, assuming a uniform hierarchy across all data points. However, real-world hierarchical structures exhibit significant diversity, making this assumption overly restrictive. In this paper, we propose a geometry-aware distance measure in hyperbolic spaces, which dynamically adapts to varying hierarchical structures. Our approach derives the distance measure by generating tailored projections and curvatures for each pair of data points, effectively mapping them to an appropriate hyperbolic space. We introduce a revised low-rank decomposition scheme and a hard-pair mining mechanism to mitigate the computational cost of pair-wise distance computation without compromising accuracy. We present an upper bound on the low-rank approximation error using Talagrand's concentration inequality, ensuring theoretical robustness. Extensive experiments on standard image classification (MNIST, CIFAR-10 and CIFAR-100), hierarchical classification (5-level CIFAR-100), and few-shot learning tasks (mini-ImageNet, tiered-ImageNet) demonstrate the effectiveness of our method. Our approach consistently outperforms learning methods that use fixed distance measures, with notable improvements on few-shot learning tasks, where it achieves over 5\% gains on mini-ImageNet. The results reveal that adaptive distance measures better capture diverse hierarchical structures, with visualization showing clearer class boundaries and improved prototype separation in hyperbolic spaces.