Hu Wang

CV
h-index47
64papers
1,351citations
Novelty48%
AI Score60

64 Papers

CVJul 26, 2023Code
Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Hu Wang, Yuanhong Chen, Congbo Ma et al.

The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour. The code repository address is https://github.com/billhhh/ShaSpec/.

LGSep 15, 2022Code
M^4I: Multi-modal Models Membership Inference

Pingyi Hu, Zihan Wang, Ruoxi Sun et al.

With the development of machine learning techniques, the attention of research has been moved from single-modal learning to multi-modal learning, as real-world data exist in the form of different modalities. However, multi-modal models often carry more information than single-modal models and they are usually applied in sensitive scenarios, such as medical report generation or disease identification. Compared with the existing membership inference against machine learning classifiers, we focus on the problem that the input and output of the multi-modal models are in different modalities, such as image captioning. This work studies the privacy leakage of multi-modal models through the lens of membership inference attack, a process of determining whether a data record involves in the model training process or not. To achieve this, we propose Multi-modal Models Membership Inference (M^4I) with two attack methods to infer the membership status, named metric-based (MB) M^4I and feature-based (FB) M^4I, respectively. More specifically, MB M^4I adopts similarity metrics while attacking to infer target data membership. FB M^4I uses a pre-trained shadow multi-modal feature extractor to achieve the purpose of data inference attack by comparing the similarities from extracted input and output features. Extensive experimental results show that both attack methods can achieve strong performances. Respectively, 72.5% and 94.83% of attack success rates on average can be obtained under unrestricted scenarios. Moreover, we evaluate multiple defense mechanisms against our attacks. The source code of M^4I attacks is publicly available at https://github.com/MultimodalMI/Multimodal-membership-inference.git.

85.1CVMay 29Code
Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Hao Zheng, Hu Wang, Tiantian Zheng et al.

Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation.

IVJul 5, 2023
Distilling Missing Modality Knowledge from Ultrasound for Endometriosis Diagnosis with Magnetic Resonance Images

Yuan Zhang, Hu Wang, David Butler et al.

Endometriosis is a common chronic gynecological disorder that has many characteristics, including the pouch of Douglas (POD) obliteration, which can be diagnosed using Transvaginal gynecological ultrasound (TVUS) scans and magnetic resonance imaging (MRI). TVUS and MRI are complementary non-invasive endometriosis diagnosis imaging techniques, but patients are usually not scanned using both modalities and, it is generally more challenging to detect POD obliteration from MRI than TVUS. To mitigate this classification imbalance, we propose in this paper a knowledge distillation training algorithm to improve the POD obliteration detection from MRI by leveraging the detection results from unpaired TVUS data. More specifically, our algorithm pre-trains a teacher model to detect POD obliteration from TVUS data, and it also pre-trains a student model with 3D masked auto-encoder using a large amount of unlabelled pelvic 3D MRI volumes. Next, we distill the knowledge from the teacher TVUS POD obliteration detector to train the student MRI model by minimizing a regression loss that approximates the output of the student to the teacher using unpaired TVUS and MRI data. Experimental results on our endometriosis dataset containing TVUS and MRI data demonstrate the effectiveness of our method to improve the POD detection accuracy from MRI.

CVApr 6, 2023
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

Yuanhong Chen, Yuyuan Liu, Hu Wang et al.

Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.

CVSep 21, 2022
Multi-view Local Co-occurrence and Global Consistency Learning Improve Mammogram Classification Generalisation

Yuanhong Chen, Hu Wang, Chong Wang et al.

When analysing screening mammograms, radiologists can naturally process information across two ipsilateral views of each breast, namely the cranio-caudal (CC) and mediolateral-oblique (MLO) views. These multiple related images provide complementary diagnostic information and can improve the radiologist's classification accuracy. Unfortunately, most existing deep learning systems, trained with globally-labelled images, lack the ability to jointly analyse and integrate global and local information from these multiple views. By ignoring the potentially valuable information present in multiple images of a screening episode, one limits the potential accuracy of these systems. Here, we propose a new multi-view global-local analysis method that mimics the radiologist's reading procedure, based on a global consistency learning and local co-occurrence learning of ipsilateral views in mammograms. Extensive experiments show that our model outperforms competing methods, in terms of classification accuracy and generalisation, on a large-scale private dataset and two publicly available datasets, where models are exclusively trained and tested with global labels.

CVJul 22, 2022
Uncertainty-aware Multi-modal Learning via Cross-modal Random Network Prediction

Hu Wang, Jianpeng Zhang, Yuanhong Chen et al.

Multi-modal learning focuses on training models by equally combining multiple input data modalities during the prediction process. However, this equal combination can be detrimental to the prediction accuracy because different modalities are usually accompanied by varying levels of uncertainty. Using such uncertainty to combine modalities has been studied by a couple of approaches, but with limited success because these approaches are either designed to deal with specific classification or segmentation problems and cannot be easily translated into other tasks, or suffer from numerical instabilities. In this paper, we propose a new Uncertainty-aware Multi-modal Learner that estimates uncertainty by measuring feature density via Cross-modal Random Network Prediction (CRNP). CRNP is designed to require little adaptation to translate between different prediction tasks, while having a stable training process. From a technical point of view, CRNP is the first approach to explore random network prediction to estimate uncertainty and to combine multi-modal data. Experiments on two 3D multi-modal medical image segmentation tasks and three 2D multi-modal computer vision classification tasks show the effectiveness, adaptability and robustness of CRNP. Also, we provide an extensive discussion on different fusion functions and visualization to validate the proposed model.

IVMar 3, 2022
BoMD: Bag of Multi-label Descriptors for Noisy Chest X-ray Classification

Yuanhong Chen, Fengbei Liu, Hu Wang et al.

Deep learning methods have shown outstanding classification accuracy in medical imaging problems, which is largely attributed to the availability of large-scale datasets manually annotated with clean labels. However, given the high cost of such manual annotation, new medical imaging classification problems may need to rely on machine-generated noisy labels extracted from radiology reports. Indeed, many Chest X-ray (CXR) classifiers have already been modelled from datasets with noisy labels, but their training procedure is in general not robust to noisy-label samples, leading to sub-optimal models. Furthermore, CXR datasets are mostly multi-label, so current noisy-label learning methods designed for multi-class problems cannot be easily adapted. In this paper, we propose a new method designed for the noisy multi-label CXR learning, which detects and smoothly re-labels samples from the dataset, which is then used to train common multi-label classifiers. The proposed method optimises a bag of multi-label descriptors (BoMD) to promote their similarity with the semantic descriptors produced by BERT models from the multi-label image annotation. Our experiments on diverse noisy multi-label training sets and clean testing sets show that our model has state-of-the-art accuracy and robustness in many CXR multi-label classification benchmarks.

CVJul 9, 2024Code
ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation

Yuyuan Liu, Yuanhong Chen, Hu Wang et al.

The costly and time-consuming annotation process to produce large training sets for modelling semantic LiDAR segmentation methods has motivated the development of semi-supervised learning (SSL) methods. However, such SSL approaches often concentrate on employing consistency learning only for individual LiDAR representations. This narrow focus results in limited perturbations that generally fail to enable effective consistency learning. Additionally, these SSL approaches employ contrastive learning based on the sampling from a limited set of positive and negative embedding samples. This paper introduces a novel semi-supervised LiDAR semantic segmentation framework called ItTakesTwo (IT2). IT2 is designed to ensure consistent predictions from peer LiDAR representations, thereby improving the perturbation effectiveness in consistency learning. Furthermore, our contrastive learning employs informative samples drawn from a distribution of positive and negative embeddings learned from the entire training set. Results on public benchmarks show that our approach achieves remarkable improvements over the previous state-of-the-art (SOTA) methods in the field. The code is available at: https://github.com/yyliu01/IT2.

CVOct 2, 2023
Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality

Hu Wang, Congbo Ma, Jianpeng Zhang et al.

The problem of missing modalities is both critical and non-trivial to be handled in multi-modal models. It is common for multi-modal tasks that certain modalities contribute more compared to other modalities, and if those important modalities are missing, the model performance drops significantly. Such fact remains unexplored by current multi-modal approaches that recover the representation from missing modalities by feature reconstruction or blind feature aggregation from other modalities, instead of extracting useful information from the best performing modalities. In this paper, we propose a Learnable Cross-modal Knowledge Distillation (LCKD) model to adaptively identify important modalities and distil knowledge from them to help other modalities from the cross-modal perspective for solving the missing modality issue. Our approach introduces a teacher election procedure to select the most ``qualified'' teachers based on their single modality performance on certain tasks. Then, cross-modal knowledge distillation is performed between teacher and student modalities for each task to push the model parameters to a point that is beneficial for all tasks. Hence, even if the teacher modalities for certain tasks are missing during testing, the available student modalities can accomplish the task well enough based on the learned knowledge from their automatically elected teacher modalities. Experiments on the Brain Tumour Segmentation Dataset 2018 (BraTS2018) shows that LCKD outperforms other methods by a considerable margin, improving the state-of-the-art performance by 3.61% for enhancing tumour, 5.99% for tumour core, and 3.76% for whole tumour in terms of segmentation Dice score.

CVJul 7, 2024
CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

Yuanhong Chen, Chong Wang, Yuyuan Liu et al.

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.

NANov 14, 2017
A novel X-FEM based fast computational method for crack propagation

Zhenxing Cheng, Hu Wang

This study suggests a fast computational method for crack propagation, which is based on the extended finite element method (X-FEM). It is well known that the X-FEM might be the most popular numerical method for crack propagation. However, with the increase of complexity of the given problem, the size of FE model and the number of iterative steps are increased correspondingly. To improve the efficiency of X-FEM, an efficient computational method termed decomposed updating reanalysis (DUR) method is suggested. For most of X-FEM simulation procedures, the change of each iterative step is small and it will only lead a local change of stiffness matrix. Therefore, the DUR method is proposed to predict the modified response by only calculating the changed part of equilibrium equations. Compared with other fast computational methods, the distinctive characteristic of the proposed method is to update the modified stiffness matrix with a local updating strategy, which only the changed part of stiffness matrix needs to be updated. To verify the performance of the DUR method, several typical numerical examples have been analyzed and the results demonstrate that this method is a highly efficient method with high accuracy.

CVOct 31, 2025Code
T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

Raza Imam, Hu Wang, Dwarikanath Mahapatra et al.

In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.

CVSep 3, 2024
Human-AI Collaborative Multi-modal Multi-rater Learning for Endometriosis Diagnosis

Hu Wang, David Butler, Yuan Zhang et al.

Endometriosis, affecting about 10% of individuals assigned female at birth, is challenging to diagnose and manage. Diagnosis typically involves the identification of various signs of the disease using either laparoscopic surgery or the analysis of T1/T2 MRI images, with the latter being quicker and cheaper but less accurate. A key diagnostic sign of endometriosis is the obliteration of the Pouch of Douglas (POD). However, even experienced clinicians struggle with accurately classifying POD obliteration from MRI images, which complicates the training of reliable AI models. In this paper, we introduce the Human-AI Collaborative Multi-modal Multi-rater Learning (HAICOMM) methodology to address the challenge above. HAICOMM is the first method that explores three important aspects of this problem: 1) multi-rater learning to extract a cleaner label from the multiple "noisy" labels available per training sample; 2) multi-modal learning to leverage the presence of T1/T2 MRI images for training and testing; and 3) human-AI collaboration to build a system that leverages the predictions from clinicians and the AI model to provide more accurate classification than standalone clinicians and AI models. Presenting results on the multi-rater T1/T2 MRI endometriosis dataset that we collected to validate our methodology, the proposed HAICOMM model outperforms an ensemble of clinicians, noisy-label learning models, and multi-rater learning methods.

CLSep 13, 2022
Document-aware Positional Encoding and Linguistic-guided Encoding for Abstractive Multi-document Summarization

Congbo Ma, Wei Emma Zhang, Pitawelayalage Dasun Dileepa Pitawela et al.

One key challenge in multi-document summarization is to capture the relations among input documents that distinguish between single document summarization (SDS) and multi-document summarization (MDS). Few existing MDS works address this issue. One effective way is to encode document positional information to assist models in capturing cross-document relations. However, existing MDS models, such as Transformer-based models, only consider token-level positional information. Moreover, these models fail to capture sentences' linguistic structure, which inevitably causes confusions in the generated summaries. Therefore, in this paper, we propose document-aware positional encoding and linguistic-guided encoding that can be fused with Transformer architecture for MDS. For document-aware positional encoding, we introduce a general protocol to guide the selection of document encoding functions. For linguistic-guided encoding, we propose to embed syntactic dependency relations into the dependency relation mask with a simple but effective non-linear encoding learner for feature learning. Extensive experiments show the proposed model can generate summaries with high quality.

CVSep 12, 2024
Deep Multimodal Learning with Missing Modality: A Survey

Renjie Wu, Hu Wang, Hsiang-Ting Chen et al.

During multimodal model training and testing, certain data modalities may be absent due to sensor limitations, cost constraints, privacy concerns, or data loss, negatively affecting performance. Multimodal learning techniques designed to handle missing modalities can mitigate this by ensuring model robustness even when some modalities are unavailable. This survey reviews recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning methods. It provides the first comprehensive survey that covers the motivation and distinctions between MLMM and standard multimodal learning setups, followed by a detailed analysis of current methods, applications, and datasets, concluding with challenges and future directions.

CVMar 18, 2024Code
MedMerge: Merging Models for Effective Transfer Learning to Medical Imaging Tasks

Ibrahim Almakky, Santosh Sanjeev, Anees Ur Rehman Hashmi et al.

Transfer learning has become a powerful tool to initialize deep learning models to achieve faster convergence and higher performance. This is especially useful in the medical imaging analysis domain, where data scarcity limits possible performance gains for deep learning models. Some advancements have been made in boosting the transfer learning performance gain by merging models starting from the same initialization. However, in the medical imaging analysis domain, there is an opportunity to merge models starting from different initializations, thus combining the features learned from different tasks. In this work, we propose MedMerge, a method whereby the weights of different models can be merged, and their features can be effectively utilized to boost performance on a new task. With MedMerge, we learn kernel-level weights that can later be used to merge the models into a single model, even when starting from different initializations. Testing on various medical imaging analysis tasks, we show that our merged model can achieve significant performance gains, with up to 7% improvement on the F1 score. The code implementation of this work is available at github.com/BioMedIA-MBZUAI/MedMerge.

CVSep 19, 2024
A Novel Perspective for Multi-modal Multi-label Skin Lesion Classification

Yuan Zhang, Yutong Xie, Hu Wang et al.

The efficacy of deep learning-based Computer-Aided Diagnosis (CAD) methods for skin diseases relies on analyzing multiple data modalities (i.e., clinical+dermoscopic images, and patient metadata) and addressing the challenges of multi-label classification. Current approaches tend to rely on limited multi-modal techniques and treat the multi-label problem as a multiple multi-class problem, overlooking issues related to imbalanced learning and multi-label correlation. This paper introduces the innovative Skin Lesion Classifier, utilizing a Multi-modal Multi-label TransFormer-based model (SkinM2Former). For multi-modal analysis, we introduce the Tri-Modal Cross-attention Transformer (TMCT) that fuses the three image and metadata modalities at various feature levels of a transformer encoder. For multi-label classification, we introduce a multi-head attention (MHA) module to learn multi-label correlations, complemented by an optimisation that handles multi-label and imbalanced learning problems. SkinM2Former achieves a mean average accuracy of 77.27% and a mean diagnostic accuracy of 77.85% on the public Derm7pt dataset, outperforming state-of-the-art (SOTA) methods.

LGMay 12, 2025Code
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang, Congbo Ma, Ian Reid et al.

The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to high variance when the reward advantage is inaccurately predicted. In this work, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) model, by using lightweight Kalman filtering to dynamically estimate the latent reward baseline and uncertainty. This filtering technique replaces the naive group mean, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through the accuracies and rewards obtained from math question answering and reasoning, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.

LGNov 14, 2024Code
Rethinking Weight-Averaged Model-merging

Hu Wang, Congbo Ma, Ibrahim Almakky et al.

Model merging, particularly through weight averaging, has shown surprising effectiveness in saving computations and improving model performance without any additional training. However, the interpretability of why and how this technique works remains unclear. In this work, we reinterpret weight-averaged model merging through the lens of interpretability and provide empirical insights into the underlying mechanisms that govern its behavior. We approach the problem from three perspectives: (1) we analyze the learned weight structures and demonstrate that model weights encode structured representations that help explain the compatibility of weight averaging; (2) we compare averaging in weight space and feature space across diverse model architectures (CNNs and ViTs) and datasets, aiming to expose under which circumstances what combination paradigm will work more effectively; (3) we study the effect of parameter scaling on prediction stability, highlighting how weight averaging acts as a form of regularization that contributes to robustness. By framing these analyses in an interpretability context, our work contributes to a more transparent and systematic understanding of model merging for stakeholders interested in the safety and reliability of untrained model combination methods. The code is available at https://github.com/billhhh/Rethink-Merge.

CVJul 10, 2024
Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection

Dengyan Luo, Yanping Xiang, Hu Wang et al.

The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensation, resulting in poor performance in the case of a video sequence including large motion. In this paper, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages. Specifically, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Then, the feature refinement module adaptively fuses the aligned features and further aggregates useful spatio-temporal information by means of the proposed Attention-guided Deformable Fusion (AGDF) block. In addition, to improve the alignment of adjacent frames with the current frame, we extend the traditional loss function by introducing a new motion compensation loss. Extensive experimental results demonstrate that the proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.

IRMay 12, 2024Code
Disentangling Specificity for Abstractive Multi-document Summarization

Congbo Ma, Wei Emma Zhang, Hu Wang et al.

Multi-document summarization (MDS) generates a summary from a document set. Each document in a set describes topic-relevant concepts, while per document also has its unique contents. However, the document specificity receives little attention from existing MDS approaches. Neglecting specific information for each document limits the comprehensiveness of the generated summaries. To solve this problem, in this paper, we propose to disentangle the specific content from documents in one document set. The document-specific representations, which are encouraged to be distant from each other via a proposed orthogonal constraint, are learned by the specific representation learner. We provide extensive analysis and have interesting findings that specific information and document set representations contribute distinctive strengths and their combination yields a more comprehensive solution for the MDS. Also, we find that the common (i.e. shared) information could not contribute much to the overall performance under the MDS settings. Implemetation codes are available at https://github.com/congboma/DisentangleSum.

IVMar 20, 2025Code
SALT: Parameter-Efficient Fine-Tuning via Singular Value Adaptation with Low-Rank Transformation

Abdelrahman Elsayed, Sarim Hashmi, Mohammed Elseiagy et al.

The complex nature of medical image segmentation calls for models that are specifically designed to capture detailed, domain-specific features. Large foundation models offer considerable flexibility, yet the cost of fine-tuning these models remains a significant barrier. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), efficiently update model weights with low-rank matrices but may suffer from underfitting when the chosen rank is insufficient to capture domain-specific nuances. Conversely, full-rank Singular Value Decomposition (SVD) based methods provide comprehensive updates by modifying all singular values, yet they often lack flexibility and exhibit variable performance across datasets. We propose SALT (Singular Value Adaptation with Low-Rank Transformation), a method that selectively adapts the most influential singular values using trainable scale and shift parameters while complementing this with a low-rank update for the remaining subspace. This hybrid approach harnesses the advantages of both LoRA and SVD, enabling effective adaptation without relying on increasing model size or depth. Evaluated on 5 challenging medical datasets, ranging from as few as 20 samples to 1000, SALT outperforms state-of-the-art PEFT (LoRA and SVD) by 2% to 5% in Dice with only 3.9% trainable parameters, demonstrating robust adaptation even in low-resource settings. The code for SALT is available at: https://github.com/BioMedIA-MBZUAI/SALT

CVMay 12, 2024Code
Meta-Learned Modality-Weighted Knowledge Distillation for Robust Multi-Modal Learning with Missing Data

Hu Wang, Salma Hassan, Yuyuan Liu et al.

In multi-modal learning, some modalities are more influential than others, and their absence can have a significant impact on classification/segmentation accuracy. Addressing this challenge, we propose a novel approach called Meta-learned Modality-weighted Knowledge Distillation (MetaKD), which enables multi-modal models to maintain high accuracy even when key modalities are missing. MetaKD adaptively estimates the importance weight of each modality through a meta-learning process. These learned importance weights guide a pairwise modality-weighted knowledge distillation process, allowing high-importance modalities to transfer knowledge to lower-importance ones, resulting in robust performance despite missing inputs. Unlike previous methods in the field, which are often task-specific and require significant modifications, our approach is designed to work in multiple tasks (e.g., segmentation and classification) with minimal adaptation. Experimental results on five prevalent datasets, including three Brain Tumor Segmentation datasets (BraTS2018, BraTS2019 and BraTS2020), the Alzheimer's Disease Neuroimaging Initiative (ADNI) classification dataset and the Audiovision-MNIST classification dataset, demonstrate the proposed model is able to outperform the compared models by a large margin. The code is available at https://github.com/billhhh/MetaKD.

CLSep 4, 2025Code
Explicit and Implicit Data Augmentation for Social Event Detection

Congbo Ma, Yuxia Wang, Jia Wu et al.

Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: https://github.com/congboma/SED-Aug.

CVJul 30, 2025Code
Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings

Dongli He, Hu Wang, Mohammad Yaqub

Accurate fetal biometric measurements, such as abdominal circumference, play a vital role in prenatal care. However, obtaining high-quality ultrasound images for these measurements heavily depends on the expertise of sonographers, posing a significant challenge in low-income countries due to the scarcity of trained personnel. To address this issue, we leverage FetalCLIP, a vision-language model pretrained on a curated dataset of over 210,000 fetal ultrasound image-caption pairs, to perform automated fetal ultrasound image quality assessment (IQA) on blind-sweep ultrasound data. We introduce FetalCLIP$_{CLS}$, an IQA model adapted from FetalCLIP using Low-Rank Adaptation (LoRA), and evaluate it on the ACOUSLIC-AI dataset against six CNN and Transformer baselines. FetalCLIP$_{CLS}$ achieves the highest F1 score of 0.757. Moreover, we show that an adapted segmentation model, when repurposed for classification, further improves performance, achieving an F1 score of 0.771. Our work demonstrates how parameter-efficient fine-tuning of fetal ultrasound foundation models can enable task-specific adaptations, advancing prenatal care in resource-limited settings. The experimental code is available at: https://github.com/donglihe-hub/FetalCLIP-IQA.

CVJul 28, 2025Code
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Ao Li, Yuxiang Duan, Jinghui Zhang et al.

Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

CVMay 19, 2025Code
IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo

Yinzhe Wang, Yiwen Xiao, Hu Wang et al.

Multi-view stereo (MVS) models based on progressive depth hypothesis narrowing have made remarkable advancements. However, existing methods haven't fully utilized the potential that the depth coverage of individual instances is smaller than that of the entire scene, which restricts further improvements in depth estimation precision. Moreover, inevitable deviations in the initial stage accumulate as the process advances. In this paper, we propose Instance-Adaptive MVS (IA-MVS). It enhances the precision of depth estimation by narrowing the depth hypothesis range and conducting refinement on each instance. Additionally, a filtering mechanism based on intra-instance depth continuity priors is incorporated to boost robustness. Furthermore, recognizing that existing confidence estimation can degrade IA-MVS performance on point clouds. We have developed a detailed mathematical model for confidence estimation based on conditional probability. The proposed method can be widely applied in models based on MVSNet without imposing extra training burdens. Our method achieves state-of-the-art performance on the DTU benchmark. The source code is available at https://github.com/KevinWang73106/IA-MVS.

CVMar 11, 2025Code
Modeling Variants of Prompts for Vision-Language Models

Ao Li, Zongfang Liu, Xinhua Li et al.

Large pre-trained vision-language models (VLMs) offer a promising approach to leveraging human language for enhancing downstream tasks. However, VLMs such as CLIP face significant limitation: its performance is highly sensitive to prompt template design. Although prompt learning methods can address the sensitivity issue by replacing natural language prompts with learnable ones, they are incomprehensible to humans. Ensuring consistent performance across various prompt templates enables models to adapt seamlessly to diverse phrasings, enhancing their ability to handle downstream tasks without requiring extensive prompt engineering. In this work, we introduce the RobustPrompt Benchmark, a systematic benchmark to evaluate robustness to different prompt templates for VLMs. It includes a dataset with hundreds of carefully designed prompt templates, divided into six types, covering a wide variety of commonly used templates. Beside the benchmark, we propose Modeling Variants of Prompts (MVP), a simple yet effective method that mitigates sensitivity by modeling variants of prompt structures. The innovation of MVP lies in decoupling prompts into templates and class names, and using Variational Autoencoders (VAE) to model the distribution of diverse prompt structures. Experiments across 11 datasets demonstrate that MVP can greatly enhance model robustness to variations in input prompts without a drop in performance. The code is available at https://github.com/liaolea/MVP.

CVNov 21, 2020Code
Robust Data Hiding Using Inverse Gradient Attention

Honglei Zhang, Hu Wang, Yuanzhouhan Cao et al.

Data hiding is the procedure of encoding desired information into a certain types of cover media (e.g. images) to resist potential noises for data recovery, while ensuring the embedded image has few perceptual perturbations. Recently, with the tremendous successes gained by deep neural networks in various fields, the research on data hiding with deep learning models has attracted an increasing amount of attentions. In deep data hiding models, to maximize the encoding capacity, each pixel of the cover image ought to be treated differently since they have different sensitivities w.r.t. visual quality. The neglecting to consider the sensitivity of each pixel inevitably affects the model's robustness for information hiding. In this paper, we propose a novel deep data hiding scheme with Inverse Gradient Attention (IGA), combining the idea of attention mechanism to endow different attention weights for different pixels. Equipped with the proposed modules, the model can spotlight pixels with more robustness for data hiding. Extensive experiments demonstrate that the proposed model outperforms the mainstream deep learning based data hiding methods on two prevalent datasets under multiple evaluation metrics. Besides, we further identify and discuss the connections between the proposed inverse gradient attention and high-frequency regions within images, which can serve as an informative reference to the deep data hiding research community. The codes are available at: https://github.com/hongleizhang/IGA.

CVDec 22, 2019Code
Unsupervised Representation Learning by Predicting Random Distances

Hu Wang, Guansong Pang, Chunhua Shen et al.

Deep neural networks have gained tremendous success in a broad range of machine learning tasks due to its remarkable capability to learn semantic-rich features from high-dimensional data. However, they often require large-scale labelled data to successfully learn such features, which significantly hinders their adaption into unsupervised learning tasks, such as anomaly detection and clustering, and limits their applications into critical domains where obtaining massive labelled data is prohibitively expensive. To enable unsupervised learning on those domains, in this work we propose to learn features without using any labelled data by training neural networks to predict data distances in a randomly projected space. Random mapping is a theoretically proven approach to obtain approximately preserved distances. To well predict these random distances, the representation learner is optimised to learn genuine class structures that are implicitly embedded in the randomly projected space. Empirical results on 19 real-world datasets show that our learned representations substantially outperform a few state-of-the-art competing methods in both anomaly detection and clustering tasks. Code is available at https://git.io/RDP

11.6CVMay 9
Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours

Jin Liu, Wang Wang, Hongxu Pu et al.

AI-assisted bridge defect inspection often produces bounding boxes with crude geometry or raster masks that are costly to store, transmit, and reuse. This study investigates how detected defects can be represented as compact, recoverable contour-level vector records in image space. We propose Frequency-Supervised Fourier Series Detection (FS-FSD), which directly regresses Fourier contour descriptors and evaluates boxes, masks, and contours under a unified polygon-space protocol. On 3,767 UAV-collected bridge images with 42,346 defect instances, FS-FSD achieves higher polygon-space accuracy and better matched-TP geometric quality than representative detection, segmentation, and contour baselines. These results show that, compared with bounding boxes and raster masks, Fourier contour records preserve defect-boundary geometry in a more compact, recoverable, and shareable form for engineering review and downstream information workflows. Future work will study the modeling of multi-region, fragmented, and adjacent bridge-defect boundaries and extend the framework toward long-term bridge-defect tracking and lifecycle-oriented management.

49.0CVMay 5
FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection

Kaixiang Zhao, Mao Ye, Lihua Zhou et al.

Open-vocabulary object detection often fails under distribution shifts, as it can be misled by spurious correlations between non-causal visual attributes (e.g., brightness, texture) and object categories. Existing test-time adaptation (TTA) methods either depend on costly online optimization or perform global calibration, overlooking the attribute-specific nature of these failures. To address this, we propose FACTOR (counterFACtual training-free Test-time adaptation for Open-vocabulaRy object detection), a lightweight framework grounded in counterfactual reasoning. By perturbing test images along non-causal attributes and comparing region-level predictions between original and counterfactual views, FACTOR quantifies attribute sensitivity, semantic relevance, and prediction variation to selectively suppress attribute-dependent predictions-without parameter updates. Experiments on PASCAL-C, COCO-C, and FoggyCityscapes show that FACTOR consistently outperforms prior TTA methods, demonstrating that explicit counterfactual reasoning effectively improves robustness under distribution shifts.

CVDec 14, 2023
Segment Beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation

Renjie Wu, Hu Wang, Feras Dayoub et al.

Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.

LGDec 12, 2023
Dynamically configured physics-informed neural network in topology optimization applications

Jichao Yin, Ziming Wen, Shuhao Li et al.

Integration of machine learning (ML) into the topology optimization (TO) framework is attracting increasing attention, but data acquisition in data-driven models is prohibitive. Compared with popular ML methods, the physics-informed neural network (PINN) can avoid generating enormous amounts of data when solving forward problems and additionally provide better inference. To this end, a dynamically configured PINN-based topology optimization (DCPINN-TO) method is proposed. The DCPINN is composed of two subnetworks, namely the backbone neural network (NN) and the coefficient NN, where the coefficient NN has fewer trainable parameters. The designed architecture aims to dynamically configure trainable parameters; that is, an inexpensive NN is used to replace an expensive one at certain optimization cycles. Furthermore, an active sampling strategy is proposed to selectively sample collocations depending on the pseudo-densities at each optimization cycle. In this manner, the number of collocations will decrease with the optimization process but will hardly affect it. The Gaussian integral is used to calculate the strain energy of elements, which yields a byproduct of decoupling the mapping of the material at the collocations. Several examples with different resolutions validate the feasibility of the DCPINN-TO method, and multiload and multiconstraint problems are employed to illustrate its generalization. In addition, compared to finite element analysis-based TO (FEA-TO), the accuracy of the displacement prediction and optimization results indicate that the DCPINN-TO method is effective and efficient.

CVFeb 27, 2025
In-Model Merging for Enhancing the Robustness of Medical Imaging Classification Models

Hu Wang, Ibrahim Almakky, Congbo Ma et al.

Model merging is an effective strategy to merge multiple models for enhancing model performances, and more efficient than ensemble learning as it will not introduce extra computation into inference. However, limited research explores if the merging process can occur within one model and enhance the model's robustness, which is particularly critical in the medical image domain. In the paper, we are the first to propose in-model merging (InMerge), a novel approach that enhances the model's robustness by selectively merging similar convolutional kernels in the deep layers of a single convolutional neural network (CNN) during the training process for classification. We also analytically reveal important characteristics that affect how in-model merging should be performed, serving as an insightful reference for the community. We demonstrate the feasibility and effectiveness of this technique for different CNN architectures on 4 prevalent datasets. The proposed InMerge-trained model surpasses the typically-trained model by a substantial margin. The code will be made public.

IVJul 31, 2025
EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision

Ahmed Jaheen, Abdelrahman Elsayed, Damir Kim et al.

Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt -- an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.

CVMay 2, 2025
High Dynamic Range Novel View Synthesis with Single Exposure

Kaixuan Zhang, Hu Wang, Minxian Li et al.

High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has significant limitations, including susceptibility to motion artifacts (e.g., ghosting and blurring), high capture and storage costs. To overcome these challenges, we introduce, for the first time, the single-exposure HDR-NVS problem, where only single exposure LDR images are available during training. We further introduce a novel approach, Mono-HDR-3D, featuring two dedicated modules formulated by the LDR image formation principles, one for converting LDR colors to HDR counterparts, and the other for transforming HDR images to LDR format so that unsupervised learning is enabled in a closed loop. Designed as a meta-algorithm, our approach can be seamlessly integrated with existing NVS models. Extensive experiments show that Mono-HDR-3D significantly outperforms previous methods. Source code will be released.

CVOct 3, 2025
DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

Numan Saeed, Tausifa Jan Saleem, Fadillah Maani et al.

Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing 'universal' approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52

CVSep 8, 2025
Predicting Brain Tumor Response to Therapy using a Hybrid Deep Learning and Radiomics Approach

Daniil Tikhonov, Matheus Scatolin, Mohor Banerjee et al.

Accurate evaluation of the response of glioblastoma to therapy is crucial for clinical decision-making and patient management. The Response Assessment in Neuro-Oncology (RANO) criteria provide a standardized framework to assess patients' clinical response, but their application can be complex and subject to observer variability. This paper presents an automated method for classifying the intervention response from longitudinal MRI scans, developed to predict tumor response during therapy as part of the BraTS 2025 challenge. We propose a novel hybrid framework that combines deep learning derived feature extraction and an extensive set of radiomics and clinically chosen features. Our approach utilizes a fine-tuned ResNet-18 model to extract features from 2D regions of interest across four MRI modalities. These deep features are then fused with a rich set of more than 4800 radiomic and clinically driven features, including 3D radiomics of tumor growth and shrinkage masks, volumetric changes relative to the nadir, and tumor centroid shift. Using the fused feature set, a CatBoost classifier achieves a mean ROC AUC of 0.81 and a Macro F1 score of 0.50 in the 4-class response prediction task (Complete Response, Partial Response, Stable Disease, Progressive Disease). Our results highlight that synergizing learned image representations with domain-targeted radiomic features provides a robust and effective solution for automated treatment response assessment in neuro-oncology.

CVAug 21, 2025
Bidirectional Temporal Information Propagation for Moving Infrared Small Target Detection

Dengyan Luo, Yanping Xiang, Hu Wang et al.

Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.

CVJul 7, 2025
GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation

Weilin Lai, Tie Xu, Hu Wang

Direct B-Rep generation is increasingly important in CAD workflows, eliminating costly modeling sequence data and supporting complex features. A key challenge is modeling joint distribution of the misaligned geometry and topology. Existing methods tend to implicitly embed topology into the geometric features of edges. Although this integration ensures feature alignment, it also causes edge geometry to carry more redundant structural information compared to the original B-Rep, leading to significantly higher computational cost. To reduce redundancy, we propose GraphBrep, a B-Rep generation model that explicitly represents and learns compact topology. Following the original structure of B-Rep, we construct an undirected weighted graph to represent surface topology. A graph diffusion model is employed to learn topology conditioned on surface features, serving as the basis for determining connectivity between primitive surfaces. The explicit representation ensures a compact data structure, effectively reducing computational cost during both training and inference. Experiments on two large-scale unconditional datasets and one category-conditional dataset demonstrate the proposed method significantly reduces training and inference times (up to 31.3% and 56.3% for given datasets, respectively) while maintaining high-quality CAD generation compared with SOTA.

CVJun 15, 2025
Intriguing Frequency Interpretation of Adversarial Robustness for CNNs and ViTs

Lu Chen, Han Yang, Hu Wang et al.

Adversarial examples have attracted significant attention over the years, yet understanding their frequency-based characteristics remains insufficient. In this paper, we investigate the intriguing properties of adversarial examples in the frequency domain for the image classification task, with the following key findings. (1) As the high-frequency components increase, the performance gap between adversarial and natural examples becomes increasingly pronounced. (2) The model performance against filtered adversarial examples initially increases to a peak and declines to its inherent robustness. (3) In Convolutional Neural Networks, mid- and high-frequency components of adversarial examples exhibit their attack capabilities, while in Transformers, low- and mid-frequency components of adversarial examples are particularly effective. These results suggest that different network architectures have different frequency preferences and that differences in frequency components between adversarial and natural examples may directly influence model robustness. Based on our findings, we further conclude with three useful proposals that serve as a valuable reference to the AI model security community.

CLSep 23, 2021
Dependency Structure for News Document Summarization

Congbo Ma, Wei Emma Zhang, Hu Wang et al.

In this work, we develop a neural network based model which leverages dependency parsing to capture cross-positional dependencies and grammatical structures. With the help of linguistic signals, sentence-level relations can be correctly captured, thus improving news documents summarization performance. Empirical studies demonstrate that this simple but effective method outperforms existing works on the benchmark dataset. Extensive analyses examine different settings and configurations of the proposed model which provide a good reference to the community.

CVJul 20, 2021
Data Hiding with Deep Learning: A Survey Unifying Digital Watermarking and Steganography

Zihan Wang, Olivia Byrnes, Hu Wang et al.

The advancement of secure communication and identity verification fields has significantly increased through the use of deep learning techniques for data hiding. By embedding information into a noise-tolerant signal such as audio, video, or images, digital watermarking and steganography techniques can be used to protect sensitive intellectual property and enable confidential communication, ensuring that the information embedded is only accessible to authorized parties. This survey provides an overview of recent developments in deep learning techniques deployed for data hiding, categorized systematically according to model architectures and noise injection methods. The objective functions, evaluation metrics, and datasets used for training these data hiding models are comprehensively summarised. Additionally, potential future research directions that unite digital watermarking and steganography on software engineering to enhance security and mitigate risks are suggested and deliberated. This contribution furthers the creation of a more trustworthy digital world and advances Responsible AI.

LGMay 20, 2021
Fed-EINI: An Efficient and Interpretable Inference Framework for Decision Tree Ensembles in Federated Learning

Xiaolin Chen, Shuai Zhou, Bei guan et al.

The increasing concerns about data privacy and security drive an emerging field of studying privacy-preserving machine learning from isolated data sources, i.e., federated learning. A class of federated learning, vertical federated learning, where different parties hold different features for common users, has a great potential of driving a great variety of business cooperation among enterprises in many fields. In machine learning, decision tree ensembles such as gradient boosting decision trees (GBDT) and random forest are widely applied powerful models with high interpretability and modeling efficiency. However, stateof-art vertical federated learning frameworks adapt anonymous features to avoid possible data breaches, makes the interpretability of the model compromised. To address this issue in the inference process, in this paper, we firstly make a problem analysis about the necessity of disclosure meanings of feature to Guest Party in vertical federated learning. Then we find the prediction result of a tree could be expressed as the intersection of results of sub-models of the tree held by all parties. With this key observation, we protect data privacy and allow the disclosure of feature meaning by concealing decision paths and adapt a communication-efficient secure computation method for inference outputs. The advantages of Fed-EINI will be demonstrated through both theoretical analysis and extensive numerical results. We improve the interpretability of the model by disclosing the meaning of features while ensuring efficiency and accuracy.

CVApr 19, 2021
Kernel Adversarial Learning for Real-world Image Super-resolution

Hu Wang, Congbo Ma, Jianpeng Zhang et al.

Current deep image super-resolution (SR) approaches aim to restore high-resolution images from down-sampled images or by assuming degradation from simple Gaussian kernels and additive noises. However, these techniques only assume crude approximations of the real-world image degradation process, which should involve complex kernels and noise patterns that are difficult to model using simple assumptions. In this paper, we propose a more realistic process to synthesise low-resolution images for real-world image SR by introducing a new Kernel Adversarial Learning Super-resolution (KASR) framework. In the proposed framework, degradation kernels and noises are adaptively modelled rather than explicitly specified. Moreover, we also propose a high-frequency selective objective and an iterative supervision process to further boost the model SR reconstruction accuracy. Extensive experiments validate the effectiveness of the proposed framework on real-world datasets.

CRFeb 23, 2021
Oriole: Thwarting Privacy against Trustworthy Deep Learning Models

Liuqiao Chen, Hu Wang, Benjamin Zi Hao Zhao et al.

Deep Neural Networks have achieved unprecedented success in the field of face recognition such that any individual can crawl the data of others from the Internet without their explicit permission for the purpose of training high-precision face recognition models, creating a serious violation of privacy. Recently, a well-known system named Fawkes (published in USENIX Security 2020) claimed this privacy threat can be neutralized by uploading cloaked user images instead of their original images. In this paper, we present Oriole, a system that combines the advantages of data poisoning attacks and evasion attacks, to thwart the protection offered by Fawkes, by training the attacker face recognition model with multi-cloaked images generated by Oriole. Consequently, the face recognition accuracy of the attack model is maintained and the weaknesses of Fawkes are revealed. Experimental results show that our proposed Oriole system is able to effectively interfere with the performance of the Fawkes system to achieve promising attacking results. Our ablation study highlights multiple principal factors that affect the performance of the Oriole system, including the DSSIM perturbation budget, the ratio of leaked clean user images, and the numbers of multi-cloaks for each uncloaked image. We also identify and discuss at length the vulnerabilities of Fawkes. We hope that the new methodology presented in this paper will inform the security community of a need to design more robust privacy-preserving deep learning models.

LGFeb 21, 2021
Delayed Rewards Calibration via Reward Empirical Sufficiency

Yixuan Liu, Hu Wang, Xiaowei Wang et al.

Appropriate credit assignment for delay rewards is a fundamental challenge for reinforcement learning. To tackle this problem, we introduce a delay reward calibration paradigm inspired from a classification perspective. We hypothesize that well-represented state vectors share similarities with each other since they contain the same or equivalent essential information. To this end, we define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to environmental reward signals in the consequent steps. Therefore, a purify-trained classifier is designed to obtain the distribution and generate the calibrated rewards. We examine the correctness of sufficient state extraction by tracking the real-time extraction and building different reward functions in environments. The results demonstrate that the classifier could generate timely and accurate calibrated rewards. Moreover, the rewards are able to make the model training process more efficient. Finally, we identify and discuss that the sufficient states extracted by our model resonate with the observations of humans.

SYJan 24, 2021
Multi-intersection Traffic Optimisation: A Benchmark Dataset and a Strong Baseline

Hu Wang, Hao Chen, Qi Wu et al.

The control of traffic signals is fundamental and critical to alleviate traffic congestion in urban areas. However, it is challenging since traffic dynamics are complicated in real-world scenarios. Because of the high complexity of the optimisation problem for modelling the traffic, experimental settings of existing works are often inconsistent. Moreover, it is not trivial to control multiple intersections properly in real complex traffic scenarios due to its vast state and action space. Failing to take intersection topology relations into account also results in inferior solutions. To address these issues, in this work we carefully design our settings and propose a new dataset including both synthetic and real traffic data in more complex scenarios. Additionally, we propose a novel baseline model with strong performance. It is based on deep reinforcement learning with an encoder-decoder structure: an edge-weighted graph convolutional encoder to excavate multi-intersection relations; and an unified structure decoder to jointly model multiple junctions in a comprehensive manner, which significantly reduces the number of the model parameters. By doing so, the proposed model is able to effectively deal with the multi-intersection traffic optimisation problem. Models are trained/tested on both synthetic and real maps and traffic data with the Simulation of Urban Mobility (SUMO) simulator. Experimental results show that the proposed model surpasses multiple competitive methods.