CVDec 16, 2022
Biomedical image analysis competitions: The state of current participation practiceMatthias Eisenmann, Annika Reinke, Vivienn Weru et al. · utoronto
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
CVJul 23, 2023
Downstream-agnostic Adversarial ExamplesZiqi Zhou, Shengshan Hu, Ruizhi Zhao et al.
Self-supervised learning usually uses a large amount of unlabeled data to pre-train an encoder which can be used as a general-purpose feature extractor, such that downstream users only need to perform fine-tuning operations to enjoy the benefit of "large model". Despite this promising prospect, the security of pre-trained encoder has not been thoroughly investigated yet, especially when the pre-trained encoder is publicly available for commercial use. In this paper, we propose AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on the pre-trained encoder. AdvEncoder aims to construct a universal adversarial perturbation or patch for a set of natural images that can fool all the downstream tasks inheriting the victim pre-trained encoder. Unlike traditional adversarial example works, the pre-trained encoder only outputs feature vectors rather than classification labels. Therefore, we first exploit the high frequency component information of the image to guide the generation of adversarial examples. Then we design a generative attack framework to construct adversarial perturbations/patches by learning the distribution of the attack surrogate dataset to improve their attack success rates and transferability. Our results show that an attacker can successfully attack downstream tasks without knowing either the pre-training dataset or the downstream dataset. We also tailor four defenses for pre-trained encoders, the results of which further prove the attack ability of AdvEncoder.
CVAug 8, 2022
Generalizable Medical Image Segmentation via Random Amplitude Mixup and Domain-Specific Image RestorationZiqi Zhou, Lei Qi, Yinghuan Shi
For medical image analysis, segmentation models trained on one or several domains lack generalization ability to unseen domains due to discrepancies between different data acquisition policies. We argue that the degeneration in segmentation performance is mainly attributed to overfitting to source domains and domain shift. To this end, we present a novel generalizable medical image segmentation method. To be specific, we design our approach as a multi-task paradigm by combining the segmentation model with a self-supervision domain-specific image restoration (DSIR) module for model regularization. We also design a random amplitude mixup (RAM) module, which incorporates low-level frequency information of different domain images to synthesize new images. To guide our model be resistant to domain shift, we introduce a semantic consistency loss. We demonstrate the performance of our method on two public generalizable segmentation benchmarks in medical images, which validates our method could achieve the state-of-the-art performance.
CVAug 14, 2023
AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive LearningZiqi Zhou, Shengshan Hu, Minghui Li et al.
Multimodal contrastive learning aims to train a general-purpose feature extractor, such as CLIP, on vast amounts of raw, unlabeled paired image-text data. This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use. In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. AdvCLIP aims to construct a universal adversarial patch for a set of natural images that can fool all the downstream tasks inheriting the victim cross-modal pre-trained encoder. To address the challenges of heterogeneity between different modalities and unknown downstream tasks, we first build a topological graph structure to capture the relevant positions between target samples and their neighbors. Then, we design a topology-deviation based generative adversarial network to generate a universal adversarial patch. By adding the patch to images, we minimize their embeddings similarity to different modality and perturb the sample distribution in the feature space, achieving unviersal non-targeted attacks. Our results demonstrate the excellent attack performance of AdvCLIP on two types of downstream tasks across eight datasets. We also tailor three popular defenses to mitigate AdvCLIP, highlighting the need for new defense mechanisms to defend cross-modal pre-trained encoders.
CVJul 1, 2022
BadHash: Invisible Backdoor Attacks against Deep Hashing with Clean LabelShengshan Hu, Ziqi Zhou, Yechao Zhang et al.
Due to its powerful feature learning capability and high efficiency, deep hashing has achieved great success in large-scale image retrieval. Meanwhile, extensive works have demonstrated that deep neural networks (DNNs) are susceptible to adversarial examples, and exploring adversarial attack against deep hashing has attracted many research efforts. Nevertheless, backdoor attack, another famous threat to DNNs, has not been studied for deep hashing yet. Although various backdoor attacks have been proposed in the field of image classification, existing approaches failed to realize a truly imperceptive backdoor attack that enjoys invisible triggers and clean label setting simultaneously, and they also cannot meet the intrinsic demand of image retrieval backdoor. In this paper, we propose BadHash, the first generative-based imperceptible backdoor attack against deep hashing, which can effectively generate invisible and input-specific poisoned images with clean label. Specifically, we first propose a new conditional generative adversarial network (cGAN) pipeline to effectively generate poisoned samples. For any given benign image, it seeks to generate a natural-looking poisoned counterpart with a unique invisible trigger. In order to improve the attack effectiveness, we introduce a label-based contrastive learning network LabCLN to exploit the semantic characteristics of different labels, which are subsequently used for confusing and misleading the target model to learn the embedded trigger. We finally explore the mechanism of backdoor attacks on image retrieval in the hash space. Extensive experiments on multiple benchmark datasets verify that BadHash can generate imperceptible poisoned samples with strong attack ability and transferability over state-of-the-art deep hashing schemes.
CYJul 16, 2024
BadRobot: Jailbreaking Embodied LLMs in the Physical WorldHangtao Zhang, Chenyu Zhu, Xianlong Wang et al.
Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot.
AISep 26, 2024
DarkSAM: Fooling Segment Anything Model to Segment NothingZiqi Zhou, Yufei Song, Minghui Li et al.
Segment Anything Model (SAM) has recently gained much attention for its outstanding generalization to unseen data and tasks. Despite its promising prospect, the vulnerabilities of SAM, especially to universal adversarial perturbation (UAP) have not been thoroughly investigated yet. In this paper, we propose DarkSAM, the first prompt-free universal attack framework against SAM, including a semantic decoupling-based spatial attack and a texture distortion-based frequency attack. We first divide the output of SAM into foreground and background. Then, we design a shadow target strategy to obtain the semantic blueprint of the image as the attack target. DarkSAM is dedicated to fooling SAM by extracting and destroying crucial object features from images in both spatial and frequency domains. In the spatial domain, we disrupt the semantics of both the foreground and background in the image to confuse SAM. In the frequency domain, we further enhance the attack effectiveness by distorting the high-frequency components (i.e., texture information) of the image. Consequently, with a single UAP, DarkSAM renders SAM incapable of segmenting objects across diverse images with varying prompts. Experimental results on four datasets for SAM and its two variant models demonstrate the powerful attack capability and transferability of DarkSAM.
LGJun 7, 2022
Neural Network Compression via Effective Filter Analysis and Hierarchical PruningZiqi Zhou, Li Lian, Yilong Yin et al.
Network compression is crucial to making the deep networks to be more efficient, faster, and generalizable to low-end hardware. Current network compression methods have two open problems: first, there lacks a theoretical framework to estimate the maximum compression rate; second, some layers may get over-prunned, resulting in significant network performance drop. To solve these two problems, this study propose a gradient-matrix singularity analysis-based method to estimate the maximum network redundancy. Guided by that maximum rate, a novel and efficient hierarchical network pruning algorithm is developed to maximally condense the neuronal network structure without sacrificing network performance. Substantial experiments are performed to demonstrate the efficacy of the new method for pruning several advanced convolutional neural network (CNN) architectures. Compared to existing pruning methods, the proposed pruning algorithm achieved state-of-the-art performance. At the same or similar compression ratio, the new method provided the highest network prediction accuracy as compared to other methods.
ROApr 2
Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic ManipulationXianlong Wang, Hewen Pan, Hangtao Zhang et al.
Robotic manipulation policies are increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, inference-time attacks against robotic manipulation have been extensively studied, yet backdoor attacks targeting model supply chain security in robotic policies remain largely unexplored. To fill this gap, we propose \texttt{TrojanRobot}, a backdoor injection framework for model supply chain attack scenarios, which embeds a malicious module into modular robotic policies via backdoor relationships to manipulate the LLM-to-VLM pathway and compromise the system. Our vanilla design instantiates this module as a backdoor-finetuned VLM. To further enhance attack performance, we propose a prime scheme by introducing the concept of \textit{LVLM-as-a-backdoor}, which leverages \textit{in-context instruction learning} (ICIL) to steer \textit{large vision-language model} (LVLM) behavior through backdoored system prompts. Moreover, we develop three types of prime attacks, \textit{permutation}, \textit{stagnation}, and \textit{intentional}, achieving flexible backdoor attack effects. Extensive physical-world and simulator experiments on 18 real-world manipulation tasks and 4 VLMs verify the superiority of proposed \texttt{TrojanRobot}
CVJul 18, 2024
Transferable Adversarial Facial Images for Privacy ProtectionMinghui Li, Jiangxiong Wang, Hao Zhang et al.
The success of deep face recognition (FR) systems has raised serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Previous studies proposed introducing imperceptible adversarial noises into face images to deceive those face recognition models, thus achieving the goal of enhancing facial privacy protection. Nevertheless, they heavily rely on user-chosen references to guide the generation of adversarial noises, and cannot simultaneously construct natural and highly transferable adversarial face images in black-box scenarios. In light of this, we present a novel face privacy protection scheme with improved transferability while maintain high visual quality. We propose shaping the entire face space directly instead of exploiting one kind of facial characteristic like makeup information to integrate adversarial noises. To achieve this goal, we first exploit global adversarial latent search to traverse the latent space of the generative model, thereby creating natural adversarial face images with high transferability. We then introduce a key landmark regularization module to preserve the visual identity information. Finally, we investigate the impacts of various kinds of latent spaces and find that $\mathcal{F}$ latent space benefits the trade-off between visual naturalness and adversarial transferability. Extensive experiments over two datasets demonstrate that our approach significantly enhances attack transferability while maintaining high visual quality, outperforming state-of-the-art methods by an average 25% improvement in deep FR models and 10% improvement on commercial FR APIs, including Face++, Aliyun, and Tencent.
CVJan 21
Erosion Attack for Adversarial Training to Enhance Semantic Segmentation RobustnessYufei Song, Ziqi Zhou, Menghao Deng et al.
Existing segmentation models exhibit significant vulnerability to adversarial attacks.To improve robustness, adversarial training incorporates adversarial examples into model training. However, existing attack methods consider only global semantic information and ignore contextual semantic relationships within the samples, limiting the effectiveness of adversarial training. To address this issue, we propose EroSeg-AT, a vulnerability-aware adversarial training framework that leverages EroSeg to generate adversarial examples. EroSeg first selects sensitive pixels based on pixel-level confidence and then progressively propagates perturbations to higher-confidence pixels, effectively disrupting the semantic consistency of the samples. Experimental results show that, compared to existing methods, our approach significantly improves attack effectiveness and enhances model robustness under adversarial training.
CRJan 28
UnlearnShield: Shielding Forgotten Privacy against Unlearning InversionLulu Xue, Shengshan Hu, Wei Lu et al.
Machine unlearning is an emerging technique that aims to remove the influence of specific data from trained models, thereby enhancing privacy protection. However, recent research has uncovered critical privacy vulnerabilities, showing that adversaries can exploit unlearning inversion to reconstruct data that was intended to be erased. Despite the severity of this threat, dedicated defenses remain lacking. To address this gap, we propose UnlearnShield, the first defense specifically tailored to counter unlearning inversion. UnlearnShield introduces directional perturbations in the cosine representation space and regulates them through a constraint module to jointly preserve model accuracy and forgetting efficacy, thereby reducing inversion risk while maintaining utility. Experiments demonstrate that it achieves a good trade-off among privacy protection, accuracy, and forgetting.
CVNov 30, 2023
Detecting and Corrupting Convolution-based Unlearnable ExamplesMinghui Li, Xianlong Wang, Zhifei Yu et al.
Convolution-based unlearnable examples (UEs) employ class-wise multiplicative convolutional noise to training samples, severely compromising model performance. This fire-new type of UEs have successfully countered all defense mechanisms against UEs. The failure of such defenses can be attributed to the absence of norm constraints on convolutional noise, leading to severe blurring of image features. To address this, we first design an Edge Pixel-based Detector (EPD) to identify convolution-based UEs. Upon detection of them, we propose the first defense scheme against convolution-based UEs, COrrupting these samples via random matrix multiplication by employing bilinear INterpolation (COIN) such that disrupting the distribution of class-wise multiplicative noise. To evaluate the generalization of our proposed COIN, we newly design two convolution-based UEs called VUDA and HUDA to expand the scope of convolution-based UEs. Extensive experiments demonstrate the effectiveness of detection scheme EPD and that our defense COIN outperforms 11 state-of-the-art (SOTA) defenses, achieving a significant improvement on the CIFAR and ImageNet datasets.
CVDec 12, 2025
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language ModelsHewen Pan, Cong Wei, Dashuang Liang et al.
With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.
CVApr 24
Transferable Physical-World Adversarial Patches Against Pedestrian Detection ModelsShihui Yan, Ziqi Zhou, Yufei Song et al.
Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.
CVDec 21, 2023Code
A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive PropertiesJunfei Xiao, Ziqi Zhou, Wenxuan Li et al.
This paper introduces ProLab, a novel approach using property-level label space for creating strong interpretable segmentation models. Instead of relying solely on category-specific annotations, ProLab uses descriptive properties grounded in common sense knowledge for supervising segmentation models. It is based on two core designs. First, we employ Large Language Models (LLMs) and carefully crafted prompts to generate descriptions of all involved categories that carry meaningful common sense knowledge and follow a structured format. Second, we introduce a description embedding model preserving semantic correlation across descriptions and then cluster them into a set of descriptive properties (e.g., 256) using K-Means. These properties are based on interpretable common sense knowledge consistent with theories of human recognition. We empirically show that our approach makes segmentation models perform stronger on five classic benchmarks (e.g., ADE20K, COCO-Stuff, Pascal Context, Cityscapes, and BDD). Our method also shows better scalability with extended training steps than category-level supervision. Our interpretable segmentation framework also emerges with the generalization ability to segment out-of-domain or unknown categories using only in-domain descriptive properties. Code is available at https://github.com/lambert-x/ProLab.
CVOct 9, 2025Code
Towards Real-World Deepfake Detection: A Diverse In-the-wild Dataset of Forgery FacesJunyu Shi, Minghui Li, Junguo Zuo et al.
Deepfakes, leveraging advanced AIGC (Artificial Intelligence-Generated Content) techniques, create hyper-realistic synthetic images and videos of human faces, posing a significant threat to the authenticity of social media. While this real-world threat is increasingly prevalent, existing academic evaluations and benchmarks for detecting deepfake forgery often fall short to achieve effective application for their lack of specificity, limited deepfake diversity, restricted manipulation techniques.To address these limitations, we introduce RedFace (Real-world-oriented Deepfake Face), a specialized facial deepfake dataset, comprising over 60,000 forged images and 1,000 manipulated videos derived from authentic facial features, to bridge the gap between academic evaluations and real-world necessity. Unlike prior benchmarks, which typically rely on academic methods to generate deepfakes, RedFace utilizes 9 commercial online platforms to integrate the latest deepfake technologies found "in the wild", effectively simulating real-world black-box scenarios.Moreover, RedFace's deepfakes are synthesized using bespoke algorithms, allowing it to capture diverse and evolving methods used by real-world deepfake creators. Extensive experimental results on RedFace (including cross-domain, intra-domain, and real-world social network dissemination simulations) verify the limited practicality of existing deepfake detection schemes against real-world applications. We further perform a detailed analysis of the RedFace dataset, elucidating the reason of its impact on detection performance compared to conventional datasets. Our dataset is available at: https://github.com/kikyou-220/RedFace.
CRJun 21, 2024Code
ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion PurificationXianlong Wang, Shengshan Hu, Yechao Zhang et al.
Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, built on unrealistic assumptions, or only effective against specific poison types, limiting their universal applicability. In this research, we propose a more universally effective, practical, and robust defense scheme called ECLIPSE. We first investigate the impact of Gaussian noise on the poisons and theoretically prove that any kind of poison will be largely assimilated when imposing sufficient random noise. In light of this, we assume the victim has access to an extremely limited number of clean images (a more practical scene) and subsequently enlarge this sparse set for training a denoising probabilistic model (a universal denoising tool). We then begin by introducing Gaussian noise to absorb the poisons and then apply the model for denoising, resulting in a roughly purified dataset. Finally, to address the trade-off of the inconsistency in the assimilation sensitivity of different poisons by Gaussian noise, we propose a lightweight corruption compensation module to effectively eliminate residual poisons, providing a more universal defense approach. Extensive experiments demonstrate that our defense approach outperforms 10 state-of-the-art defenses. We also propose an adaptive attack against ECLIPSE and verify the robustness of our defense scheme. Our code is available at https://github.com/CGCL-codes/ECLIPSE.
CVDec 21, 2021Code
Generalizable Cross-modality Medical Image Segmentation via Style Augmentation and Dual NormalizationZiqi Zhou, Lei Qi, Xin Yang et al.
For medical image segmentation, imagine if a model was only trained using MR images in source domain, how about its performance to directly segment CT images in target domain? This setting, namely generalizable cross-modality segmentation, owning its clinical potential, is much more challenging than other related settings, e.g., domain adaptation. To achieve this goal, we in this paper propose a novel dual-normalization model by leveraging the augmented source-similar and source-dissimilar images during our generalizable segmentation. To be specific, given a single source domain, aiming to simulate the possible appearance change in unseen target domains, we first utilize a nonlinear transformation to augment source-similar and source-dissimilar images. Then, to sufficiently exploit these two types of augmentations, our proposed dual-normalization based model employs a shared backbone yet independent batch normalization layer for separate normalization. Afterward, we put forward a style-based selection scheme to automatically choose the appropriate path in the test stage. Extensive experiments on three publicly available datasets, i.e., BraTS, Cross-Modality Cardiac, and Abdominal Multi-Organ datasets, have demonstrated that our method outperforms other state-of-the-art domain generalization methods. Code is available at https://github.com/zzzqzhou/Dual-Normalization.
AIMay 9
SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model InferenceZiqi Zhou, Peng Yang, Yuxin Liang et al.
The expansion of Artificial Intelligence-generated content service requires diffusion model serving to simultaneously achieve high throughput and low task end-to-end (E2E) latency. However, existing continuous batching methods suffer from severe resource contention during UNet-VAE concurrency, leading to latency spikes. Furthermore, concurrent multi-task scheduling entails a trade-off between UNet throughput and VAE latency across varying scheduling strategies. To address these, we propose SynerDiff, an efficient continuous batching system built on intra-inter level synergy. At the intra-concurrency level, SynerDiff alleviates resource contention by pruning component-specific resource bottlenecks via VAE Chunking and Adaptive Skip-CFG. At the inter-concurrency level, leveraging components' differential sensitivity to scheduling granularities, a threshold-aware scheduler plans concurrent sequences and tunes intra-concurrency decisions to minimize VAE latency while maintaining UNet within high-throughput threshold. Additionally, a feedback controller dynamically adjusts this threshold based on queue loads to boost system capacity ceiling. Experimental results show that, SynerDiff improves throughput by 1.6$\times$ and decreases both average E2E and P99 tail latencies by up to 78.7\%, compared to benchmarks while guaranteeing high image fidelity.
CVMay 3
Dual-branch Robust Unlearnable ExamplesXianlong Wang, Hangtao Zhang, Wenbo Pan et al.
Unlearnable examples (UEs) aim to compromise model training by injecting imperceptible perturbations to clean samples. However, existing UE schemes exhibit limited robustness against advanced defenses due to their heuristic design or narrowly scoped domain perturbations. To address this, we propose \texttt{DUNE}, a \underline{\textbf{D}}ual-branch \underline{\textbf{UN}}learnable \underline{\textbf{E}}nsemble perturbation optimization approach. Specifically, \texttt{DUNE} separately optimizes perturbations in the spatial and color domains to establish the mapping between perturbations and shift-induced labels. This design extends the perturbation domain to increase noise intensity for improving robustness and drives the models to learn perturbation-oriented features with degraded generalization, thereby achieving unlearnability. To strengthen \texttt{DUNE}'s performance, we further propose an unlearnability-enhancing ensemble strategy that aggregates diverse pre-trained models during the dual-branch optimization. Extensive experiments on benchmark datasets CIFAR-10 and ImageNet verify that \texttt{DUNE}'s robustness outperforms 12 SOTA UE schemes under 7 mainstream defenses, yielding a lower average test accuracy of 14.95\% to 50.82\%.
CVMar 16, 2024
Securely Fine-tuning Pre-trained Encoders Against Adversarial ExamplesZiqi Zhou, Minghui Li, Wei Liu et al.
With the evolution of self-supervised learning, the pre-training paradigm has emerged as a predominant solution within the deep learning landscape. Model providers furnish pre-trained encoders designed to function as versatile feature extractors, enabling downstream users to harness the benefits of expansive models with minimal effort through fine-tuning. Nevertheless, recent works have exposed a vulnerability in pre-trained encoders, highlighting their susceptibility to downstream-agnostic adversarial examples (DAEs) meticulously crafted by attackers. The lingering question pertains to the feasibility of fortifying the robustness of downstream models against DAEs, particularly in scenarios where the pre-trained encoders are publicly accessible to the attackers. In this paper, we initially delve into existing defensive mechanisms against adversarial examples within the pre-training paradigm. Our findings reveal that the failure of current defenses stems from the domain shift between pre-training data and downstream tasks, as well as the sensitivity of encoder parameters. In response to these challenges, we propose Genetic Evolution-Nurtured Adversarial Fine-tuning (Gen-AF), a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our extensive experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.
RONov 18, 2024
TrojanRobot: Physical-world Backdoor Attacks Against VLM-based Robotic ManipulationXianlong Wang, Hewen Pan, Hangtao Zhang et al.
Robotic manipulation in the physical world is increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, various attacks against such robotic policies have been proposed, with backdoor attacks drawing considerable attention for their high stealth and strong persistence capabilities. However, existing backdoor efforts are limited to simulators and suffer from physical-world realization. To address this, we propose \textit{TrojanRobot}, a highly stealthy and broadly effective robotic backdoor attack in the physical world. Specifically, we introduce a module-poisoning approach by embedding a backdoor module into the modular robotic policy, enabling backdoor control over the policy's visual perception module thereby backdooring the entire robotic policy. Our vanilla implementation leverages a backdoor-finetuned VLM to serve as the backdoor module. To enhance its generalization in physical environments, we propose a prime implementation, leveraging the LVLM-as-a-backdoor paradigm and developing three types of prime attacks, \ie, \textit{permutation}, \textit{stagnation}, and \textit{intentional} attacks, thus achieving finer-grained backdoors. Extensive experiments on the UR3e manipulator with 18 task instructions using robotic policies based on four VLMs demonstrate the broad effectiveness and physical-world stealth of TrojanRobot. Our attack's video demonstrations are available via a github link https://trojanrobot.github.io.
AIMay 7, 2024
Human-centric Reward Optimization for Reinforcement Learning-based Automated Driving using Large Language ModelsZiqi Zhou, Jingyue Zhang, Jingyuan Zhang et al.
One of the key challenges in current Reinforcement Learning (RL)-based Automated Driving (AD) agents is achieving flexible, precise, and human-like behavior cost-effectively. This paper introduces an innovative approach that uses large language models (LLMs) to intuitively and effectively optimize RL reward functions in a human-centric way. We developed a framework where instructions and dynamic environment descriptions are input into the LLM. The LLM then utilizes this information to assist in generating rewards, thereby steering the behavior of RL agents towards patterns that more closely resemble human driving. The experimental results demonstrate that this approach not only makes RL agents more anthropomorphic but also achieves better performance. Additionally, various strategies for reward-proxy and reward-shaping are investigated, revealing the significant impact of prompt design on shaping an AD vehicle's behavior. These findings offer a promising direction for the development of more advanced, human-like automated driving systems. Our experimental data and source code can be found here
CVDec 22, 2024
Breaking Barriers in Physical-World Adversarial Examples: Improving Robustness and Transferability via Robust FeatureYichen Wang, Yuxuan Chou, Ziqi Zhou et al.
As deep neural networks (DNNs) are widely applied in the physical world, many researches are focusing on physical-world adversarial examples (PAEs), which introduce perturbations to inputs and cause the model's incorrect outputs. However, existing PAEs face two challenges: unsatisfactory attack performance (i.e., poor transferability and insufficient robustness to environment conditions), and difficulty in balancing attack effectiveness with stealthiness, where better attack effectiveness often makes PAEs more perceptible. In this paper, we explore a novel perturbation-based method to overcome the challenges. For the first challenge, we introduce a strategy Deceptive RF injection based on robust features (RFs) that are predictive, robust to perturbations, and consistent across different models. Specifically, it improves the transferability and robustness of PAEs by covering RFs of other classes onto the predictive features in clean images. For the second challenge, we introduce another strategy Adversarial Semantic Pattern Minimization, which removes most perturbations and retains only essential adversarial patterns in AEsBased on the two strategies, we design our method Robust Feature Coverage Attack (RFCoA), comprising Robust Feature Disentanglement and Adversarial Feature Fusion. In the first stage, we extract target class RFs in feature space. In the second stage, we use attention-based feature fusion to overlay these RFs onto predictive features of clean images and remove unnecessary perturbations. Experiments show our method's superior transferability, robustness, and stealthiness compared to existing state-of-the-art methods. Additionally, our method's effectiveness can extend to Large Vision-Language Models (LVLMs), indicating its potential applicability to more complex tasks.
CVDec 22, 2024
NumbOD: A Spatial-Frequency Fusion Attack Against Object DetectorsZiqi Zhou, Bowen Li, Yufei Song et al.
With the advancement of deep learning, object detectors (ODs) with various architectures have achieved significant success in complex scenarios like autonomous driving. Previous adversarial attacks against ODs have been focused on designing customized attacks targeting their specific structures (e.g., NMS and RPN), yielding some results but simultaneously constraining their scalability. Moreover, most efforts against ODs stem from image-level attacks originally designed for classification tasks, resulting in redundant computations and disturbances in object-irrelevant areas (e.g., background). Consequently, how to design a model-agnostic efficient attack to comprehensively evaluate the vulnerabilities of ODs remains challenging and unresolved. In this paper, we propose NumbOD, a brand-new spatial-frequency fusion attack against various ODs, aimed at disrupting object detection within images. We directly leverage the features output by the OD without relying on its internal structures to craft adversarial examples. Specifically, we first design a dual-track attack target selection strategy to select high-quality bounding boxes from OD outputs for targeting. Subsequently, we employ directional perturbations to shift and compress predicted boxes and change classification results to deceive ODs. Additionally, we focus on manipulating the high-frequency components of images to confuse ODs' attention on critical objects, thereby enhancing the attack efficiency. Our extensive experiments on nine ODs and two datasets show that NumbOD achieves powerful attack performance and high stealthiness.
CVDec 21, 2024
PB-UAP: Hybrid Universal Adversarial Attack For Image SegmentationYufei Song, Ziqi Zhou, Minghui Li et al.
With the rapid advancement of deep learning, the model robustness has become a significant research hotspot, \ie, adversarial attacks on deep neural networks. Existing works primarily focus on image classification tasks, aiming to alter the model's predicted labels. Due to the output complexity and deeper network architectures, research on adversarial examples for segmentation models is still limited, particularly for universal adversarial perturbations. In this paper, we propose a novel universal adversarial attack method designed for segmentation models, which includes dual feature separation and low-frequency scattering modules. The two modules guide the training of adversarial examples in the pixel and frequency space, respectively. Experiments demonstrate that our method achieves high attack success rates surpassing the state-of-the-art methods, and exhibits strong transferability across different models.
CVApr 25
Transferable Physical-World Adversarial Patches Against Object Detection in Autonomous DrivingZihui Zhu, Ziqi Zhou, Yichen Wang et al.
Deep learning drives major advances in autonomous driving (AD), where object detectors are central to perception. However, adversarial attacks pose significant threats to the reliability and safety of these systems, with physical adversarial patches representing a particularly potent form of attack. Physical adversarial patch attacks pose severe risks but are usually crafted for a single model, yielding poor transferability to unseen detectors. We propose AdvAD, a transfer-based physical attack against object detection in autonomous driving. Instead of targeting a specific detector, AdvAD optimizes adversarial patches over multiple detection models in a unified framework, encouraging the learned perturbations to capture shared vulnerabilities across architectures. The optimization process adaptively balances model contributions and enforces robustness to physical variations. It further employs data augmentation and geometric transformations to maintain patch effectiveness under diverse physical conditions. Experiments in both digital and real-world settings show that AdvAD consistently outperforms state-of-the-art (SOTA) attacks in performance and transferability.
CVMar 19, 2025
Test-Time Backdoor Detection for Object Detection ModelsHangtao Zhang, Yichen Wang, Shihui Yan et al.
Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection -- particularly its output of numerous objects -- pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.
CVJan 21
Enhancing Text-to-Image Generation via End-Edge Collaborative Hybrid Super-ResolutionChongbin Yi, Yuxin Liang, Ziqi Zhou et al.
Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users' Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.
CVOct 28, 2025
Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2Ziqi Zhou, Yifan Hu, Yufei Song et al.
Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.
CVOct 10, 2025
SegTrans: Transferable Adversarial Examples for Segmentation ModelsYufei Song, Ziqi Zhou, Qi Lu et al.
Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.
CVOct 9, 2025
DarkHash: A Data-Free Backdoor Attack Against Deep HashingZiqi Zhou, Menghao Deng, Yufei Song et al.
Benefiting from its superior feature learning capabilities and efficiency, deep hashing has achieved remarkable success in large-scale image retrieval. Recent studies have demonstrated the vulnerability of deep hashing models to backdoor attacks. Although these studies have shown promising attack results, they rely on access to the training dataset to implant the backdoor. In the real world, obtaining such data (e.g., identity information) is often prohibited due to privacy protection and intellectual property concerns. Embedding backdoors into deep hashing models without access to the training data, while maintaining retrieval accuracy for the original task, presents a novel and challenging problem. In this paper, we propose DarkHash, the first data-free backdoor attack against deep hashing. Specifically, we design a novel shadow backdoor attack framework with dual-semantic guidance. It embeds backdoor functionality and maintains original retrieval accuracy by fine-tuning only specific layers of the victim model using a surrogate dataset. We consider leveraging the relationship between individual samples and their neighbors to enhance backdoor attacks during training. By designing a topological alignment loss, we optimize both individual and neighboring poisoned samples toward the target sample, further enhancing the attack capability. Experimental results on four image datasets, five model architectures, and two hashing methods demonstrate the high effectiveness of DarkHash, outperforming existing state-of-the-art backdoor attack methods. Defense experiments show that DarkHash can withstand existing mainstream backdoor defense methods.
CRSep 21, 2025
MARS: A Malignity-Aware Backdoor Defense in Federated LearningWei Wan, Yuxuan Ning, Zhicong Huang et al.
Federated Learning (FL) is a distributed paradigm aimed at protecting participant data privacy by exchanging model parameters to achieve high-quality model training. However, this distributed nature also makes FL highly vulnerable to backdoor attacks. Notably, the recently proposed state-of-the-art (SOTA) attack, 3DFed (SP2023), uses an indicator mechanism to determine whether the backdoor models have been accepted by the defender and adaptively optimizes backdoor models, rendering existing defenses ineffective. In this paper, we first reveal that the failure of existing defenses lies in the employment of empirical statistical measures that are loosely coupled with backdoor attacks. Motivated by this, we propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy (BE) to indicate the malicious extent of each neuron. To amplify malignity, we further extract the most prominent BE values from each model to form a concentrated backdoor energy (CBE). Finally, a novel Wasserstein distance-based clustering method is introduced to effectively identify backdoor models. Extensive experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.
CVSep 20, 2025
ADVEDM:Fine-grained Adversarial Attack against VLM-based Embodied AgentsYichen Wang, Hangtao Zhang, Hewen Pan et al.
Vision-Language Models (VLMs), with their strong reasoning and planning capabilities, are widely used in embodied decision-making (EDM) tasks in embodied agents, such as autonomous driving and robotic manipulation. Recent research has increasingly explored adversarial attacks on VLMs to reveal their vulnerabilities. However, these attacks either rely on overly strong assumptions, requiring full knowledge of the victim VLM, which is impractical for attacking VLM-based agents, or exhibit limited effectiveness. The latter stems from disrupting most semantic information in the image, which leads to a misalignment between the perception and the task context defined by system prompts. This inconsistency interrupts the VLM's reasoning process, resulting in invalid outputs that fail to affect interactions in the physical world. To this end, we propose a fine-grained adversarial attack framework, ADVEDM, which modifies the VLM's perception of only a few key objects while preserving the semantics of the remaining regions. This attack effectively reduces conflicts with the task context, making VLMs output valid but incorrect decisions and affecting the actions of agents, thus posing a more substantial safety threat in the physical world. We design two variants of based on this framework, ADVEDM-R and ADVEDM-A, which respectively remove the semantics of a specific object from the image and add the semantics of a new object into the image. The experimental results in both general scenarios and EDM tasks demonstrate fine-grained control and excellent attack performance.
CLOct 2, 2025
Say One Thing, Do Another? Diagnosing Reasoning-Execution Gaps in VLM-Powered Mobile-Use AgentsLingzhong Dong, Ziqi Zhou, Shuaibo Yang et al.
Mobile-use agents powered by vision-language models (VLMs) have shown great potential in interpreting natural language instructions and generating corresponding actions based on mobile graphical user interface. Recent studies suggest that incorporating chain-of-thought (CoT) reasoning tends to improve the execution accuracy. However, existing evaluations emphasize execution accuracy while neglecting whether CoT reasoning aligns with ground-truth actions. This oversight fails to assess potential reasoning-execution gaps, which in turn foster over-trust: users relying on seemingly plausible CoTs may unknowingly authorize harmful actions, potentially resulting in financial loss or trust crisis. In this work, we introduce a new evaluation framework to diagnose reasoning-execution gaps. At its core lies Ground-Truth Alignment (GTA), which measures whether the action implied by a CoT matches the ground-truth action. By combining GTA with the standard Exact Match (EM) metric, we jointly assess both the reasoning accuracy and execution accuracy. This joint perspective reveals two types of reasoning-execution gaps: (i) Execution Gap (EG), where the reasoning correctly identifies the correct action but execution fails, and (ii) Reasoning Gap (RG), where execution succeeds but reasoning process conflicts with the actual execution. Experimental results across a wide range of mobile interaction tasks reveal that reasoning-execution gaps are prevalent, with execution gaps occurring more frequently than reasoning gaps. Moreover, while scaling up model size reduces the overall gap, sizable execution gaps persist even in the largest models. Further analysis shows that our framework reliably reflects systematic EG/RG patterns in state-of-the-art models. These findings offer concrete diagnostics and support the development of more trustworthy mobile-use agents.
CVOct 1, 2025
Visual Self-Refinement for Autoregressive ModelsJiamian Wang, Ziqi Zhou, Chaithanya Kumar Mummadi et al.
Autoregressive models excel in sequential modeling and have proven to be effective for vision-language data. However, the spatial nature of visual signals conflicts with the sequential dependencies of next-token prediction, leading to suboptimal results. This work proposes a plug-and-play refinement module to enhance the complex spatial correspondence modeling within the generated visual sequence. This module operates as a post-pretraining step to jointly refine all generated tokens of autoregressive model, enhancing vision-language modeling under a shared sequential prediction framework. By leveraging global context and relationship across the tokens, our method mitigates the error accumulation issue within the sequential generation. Experiments demonstrate that the proposed method improves the generation quality, enhancing the model's ability to produce semantically consistent results.
CVSep 27, 2025
TATTOO: Training-free AesTheTic-aware Outfit recOmmendationYuntian Wu, Xiaonan Hu, Ziqi Zhou et al.
The global fashion e-commerce market relies significantly on intelligent and aesthetic-aware outfit-completion tools to promote sales. While previous studies have approached the problem of fashion outfit-completion and compatible-item retrieval, most of them require expensive, task-specific training on large-scale labeled data, and no effort is made to guide outfit recommendation with explicit human aesthetics. In the era of Multimodal Large Language Models (MLLMs), we show that the conventional training-based pipeline could be streamlined to a training-free paradigm, with better recommendation scores and enhanced aesthetic awareness. We achieve this with TATTOO, a Training-free AesTheTic-aware Outfit recommendation approach. It first generates a target-item description using MLLMs, followed by an aesthetic chain-of-thought used to distill the images into a structured aesthetic profile including color, style, occasion, season, material, and balance. By fusing the visual summary of the outfit with the textual description and aesthetics vectors using a dynamic entropy-gated mechanism, candidate items can be represented in a shared embedding space and be ranked accordingly. Experiments on a real-world evaluation set Aesthetic-100 show that TATTOO achieves state-of-the-art performance compared with existing training-based methods. Another standard Polyvore dataset is also used to measure the advanced zero-shot retrieval capability of our training-free method.
LGJun 18, 2025
Towards Reliable Forgetting: A Survey on Machine Unlearning VerificationLulu Xue, Shengshan Hu, Wei Lu et al.
With growing demands for privacy protection, security, and legal compliance (e.g., GDPR), machine unlearning has emerged as a critical technique for ensuring the controllability and regulatory alignment of machine learning models. However, a fundamental challenge in this field lies in effectively verifying whether unlearning operations have been successfully and thoroughly executed. Despite a growing body of work on unlearning techniques, verification methodologies remain comparatively underexplored and often fragmented. Existing approaches lack a unified taxonomy and a systematic framework for evaluation. To bridge this gap, this paper presents the first structured survey of machine unlearning verification methods. We propose a taxonomy that organizes current techniques into two principal categories -- behavioral verification and parametric verification -- based on the type of evidence used to assess unlearning fidelity. We examine representative methods within each category, analyze their underlying assumptions, strengths, and limitations, and identify potential vulnerabilities in practical deployment. In closing, we articulate a set of open problems in current verification research, aiming to provide a foundation for developing more robust, efficient, and theoretically grounded unlearning verification mechanisms.
CVDec 12, 2024
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic ExpressionZiqi Zhou, Weize Quan, Hailin Shi et al.
Audio-driven talking head generation necessitates seamless integration of audio and visual data amidst the challenges posed by diverse input portraits and intricate correlations between audio and facial motions. In response, we propose a robust framework GoHD designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion. GoHD innovates with three key modules: Firstly, an animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles. This module achieves high disentanglement of motion and identity, and it also incorporates gaze orientation to rectify unnatural eye movements that were previously overlooked. Secondly, a conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody. Thirdly, to estimate lip-synchronized and realistic expressions from the input audio within limited training data, a two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions, e.g., blinks and frowns. Extensive experiments validate GoHD's advanced generalization capabilities, demonstrating its effectiveness in generating realistic talking face results on arbitrary subjects.
CVApr 17, 2024
Detector Collapse: Physical-World Backdooring Object Detection to Catastrophic Overload or Blindness in Autonomous DrivingHangtao Zhang, Shengshan Hu, Yichen Wang et al.
Object detection tasks, crucial in safety-critical systems like autonomous driving, focus on pinpointing object locations. These detectors are known to be susceptible to backdoor attacks. However, existing backdoor techniques have primarily been adapted from classification tasks, overlooking deeper vulnerabilities specific to object detection. This paper is dedicated to bridging this gap by introducing Detector Collapse} (DC), a brand-new backdoor attack paradigm tailored for object detection. DC is designed to instantly incapacitate detectors (i.e., severely impairing detector's performance and culminating in a denial-of-service). To this end, we develop two innovative attack schemes: Sponge for triggering widespread misidentifications and Blinding for rendering objects invisible. Remarkably, we introduce a novel poisoning strategy exploiting natural objects, enabling DC to act as a practical backdoor in real-world environments. Our experiments on different detectors across several benchmarks show a significant improvement ($\sim$10\%-60\% absolute and $\sim$2-7$\times$ relative) in attack efficacy over state-of-the-art attacks.
CRFeb 8, 2022
Rainbow Differential PrivacyZiqi Zhou, Onur Günlü, Rafael G. L. D'Oliveira et al.
We extend a previous framework for designing differentially private (DP) mechanisms via randomized graph colorings that was restricted to binary functions, corresponding to colorings in a graph, to multi-valued functions. As before, datasets are nodes in the graph and any two neighboring datasets are connected by an edge. In our setting, we assume that each dataset has a preferential ordering for the possible outputs of the mechanism, each of which we refer to as a rainbow. Different rainbows partition the graph of datasets into different regions. We show that if the DP mechanism is pre-specified at the boundary of such regions and behaves identically for all same-rainbow boundary datasets, at most one optimal such mechanism can exist and the problem can be solved by means of a morphism to a line graph. We then show closed form expressions for the line graph in the case of ternary functions. Treatment of ternary queries in this paper displays enough richness to be extended to higher-dimensional query spaces with preferential query ordering, but the optimality proof does not seem to follow directly from the ternary proof.
HCDec 27, 2021
VibEmoji: Exploring User-authoring Multi-modal Emoticons in Social CommunicationPengcheng An, Ziqi Zhou, Qing Liu et al.
Emoticons are indispensable in online communications. With users' growing needs for more customized and expressive emoticons, recent messaging applications begin to support (limited) multi-modal emoticons: e.g., enhancing emoticons with animations or vibrotactile feedback. However, little empirical knowledge has been accumulated concerning how people create, share and experience multi-modal emoticons in everyday communication, and how to better support them through design. To tackle this, we developed VibEmoji, a user-authoring multi-modal emoticon interface for mobile messaging. Extending existing designs, VibEmoji grants users greater flexibility to combine various emoticons, vibrations, and animations on-the-fly, and offers non-aggressive recommendations based on these components' emotional relevance. Using VibEmoji as a probe, we conducted a four-week field study with 20 participants, to gain new understandings from in-the-wild usage and experience, and extract implications for design. We thereby contribute both a novel system and various insights for supporting users' creation and communication of multi-modal emoticons.
CVAug 27, 2021
Binocular Mutual Learning for Improving Few-shot ClassificationZiqi Zhou, Xi Qiu, Jiangtao Xie et al.
Most of the few-shot learning methods learn to transfer knowledge from datasets with abundant labeled data (i.e., the base set). From the perspective of class space on base set, existing methods either focus on utilizing all classes under a global view by normal pretraining, or pay more attention to adopt an episodic manner to train meta-tasks within few classes in a local view. However, the interaction of the two views is rarely explored. As the two views capture complementary information, we naturally think of the compatibility of them for achieving further performance gains. Inspired by the mutual learning paradigm and binocular parallax, we propose a unified framework, namely Binocular Mutual Learning (BML), which achieves the compatibility of the global view and the local view through both intra-view and cross-view modeling. Concretely, the global view learns in the whole class space to capture rich inter-class relationships. Meanwhile, the local view learns in the local class space within each episode, focusing on matching positive pairs correctly. In addition, cross-view mutual interaction further promotes the collaborative learning and the implicit exploration of useful knowledge from each other. During meta-test, binocular embeddings are aggregated together to support decision-making, which greatly improve the accuracy of classification. Extensive experiments conducted on multiple benchmarks including cross-domain validation confirm the effectiveness of our method.
CVApr 22, 2021
SoT: Delving Deeper into Classification Head for TransformerJiangtao Xie, Ruiren Zeng, Qilong Wang et al.
Transformer models are not only successful in natural language processing (NLP) but also demonstrate high potential in computer vision (CV). Despite great advance, most of works only focus on improvement of architectures but pay little attention to the classification head. For years transformer models base exclusively on classification token to construct the final classifier, without explicitly harnessing high-level word tokens. In this paper, we propose a novel transformer model called second-order transformer (SoT), exploiting simultaneously the classification token and word tokens for the classifier. Specifically, we empirically disclose that high-level word tokens contain rich information, which per se are very competent with the classifier and moreover, are complementary to the classification token. To effectively harness such rich information, we propose multi-headed global cross-covariance pooling with singular value power normalization, which shares similar philosophy and thus is compatible with the transformer block, better than commonly used pooling methods. Then, we study comprehensively how to explicitly combine word tokens with classification token for building the final classification head. For CV tasks, our SoT significantly improves state-of-the-art vision transformers on challenging benchmarks including ImageNet and ImageNet-A. For NLP tasks, through fine-tuning based on pretrained language transformers including GPT and BERT, our SoT greatly boosts the performance on widely used tasks such as CoLA and RTE. Code will be available at https://peihuali.org/SoT
HCFeb 3, 2021
InfoColorizer: Interactive Recommendation of Color Palettes for InfographicsLin-Ping Yuan, Ziqi Zhou, Jian Zhao et al.
When designing infographics, general users usually struggle with getting desired color palettes using existing infographic authoring tools, which sometimes sacrifice customizability, require design expertise, or neglect the influence of elements' spatial arrangement. We propose a data-driven method that provides flexibility by considering users' preferences, lowers the expertise barrier via automation, and tailors suggested palettes to the spatial layout of elements. We build a recommendation engine by utilizing deep learning techniques to characterize good color design practices from data, and further develop InfoColorizer, a tool that allows users to obtain color palettes for their infographics in an interactive and dynamic manner. To validate our method, we conducted a comprehensive four-part evaluation, including case studies, a controlled user study, a survey study, and an interview study. The results indicate that InfoColorizer can provide compelling palette recommendations with adequate flexibility, allowing users to effectively obtain high-quality color design for input infographics with low effort.
IVJan 18, 2021
Deep Symmetric Adaptation Network for Cross-modality Medical Image SegmentationXiaoting Han, Lei Qi, Qian Yu et al.
Unsupervised domain adaptation (UDA) methods have shown their promising performance in the cross-modality medical image segmentation tasks. These typical methods usually utilize a translation network to transform images from the source domain to target domain or train the pixel-level classifier merely using translated source images and original target images. However, when there exists a large domain shift between source and target domains, we argue that this asymmetric structure could not fully eliminate the domain gap. In this paper, we present a novel deep symmetric architecture of UDA for medical image segmentation, which consists of a segmentation sub-network, and two symmetric source and target domain translation sub-networks. To be specific, based on two translation sub-networks, we introduce a bidirectional alignment scheme via a shared encoder and private decoders to simultaneously align features 1) from source to target domain and 2) from target to source domain, which helps effectively mitigate the discrepancy between domains. Furthermore, for the segmentation sub-network, we train a pixel-level classifier using not only original target images and translated source images, but also original source images and translated target images, which helps sufficiently leverage the semantic information from the images with different styles. Extensive experiments demonstrate that our method has remarkable advantages compared to the state-of-the-art methods in both cross-modality Cardiac and BraTS segmentation tasks.
CRJun 10, 2020
A Novel Topology-Guided Attack and Its Countermeasure Towards Secure Logic LockingYuqiao Zhang, Ayush Jain, Pinchen Cui et al.
The outsourcing of the design and manufacturing of integrated circuits (ICs) in the current horizontal semiconductor integration flow has posed various security threats due to the presence of untrusted entities, such as overproduction of ICs, sale of out-of-specification/rejected ICs, and piracy of Intellectual Properties (IPs). Consequently, logic locking emerged as one of the prominent design for trust techniques. Unfortunately, these locking techniques are now inclined to achieve complete Boolean satisfiability (SAT) resiliency after the seminal work published in [47]. In this paper, we propose a novel oracle-less attack that is based on the topological analysis of the locked netlist even though it is SAT-resilient. The attack relies on identifying and constructing unit functions with a hypothesis key to be searched in the entire netlist to find its replica. The proposed graph search algorithm efficiently finds the duplicate functions in the netlist, making it a self-referencing attack. This proposed attack is extremely efficient and can determine the secret key within a few minutes. We have also proposed a countermeasure to make the circuit resilient against this topology-guided attack to progress towards a secure logic locking technique.
CRSep 16, 2019
TAAL: Tampering Attack on Any Key-based Logic Locked CircuitsAyush Jain, Ziqi Zhou, Ujjwal Guin
Due to the globalization of semiconductor manufacturing and test processes, the system-on-a-chip (SoC) designers no longer design the complete SoC and manufacture chips on their own. This outsourcing of the design and manufacturing of Integrated Circuits (ICs) has resulted in several threats, such as overproduction of ICs, sale of out-of-specification/rejected ICs, and piracy of Intellectual Properties (IPs). Logic locking has emerged as a promising defense strategy against these threats. However, various attacks about the extraction of secret keys have undermined the security of logic locking techniques. Over the years, researchers have proposed different techniques to prevent existing attacks. In this paper, we propose a novel attack that can break any logic locking techniques that rely on the stored secret key. This proposed TAAL attack is based on implanting a hardware Trojan in the netlist, which leaks the secret key to an adversary once activated. As an untrusted foundry can extract the netlist of a design from the layout/mask information, it is feasible to implement such a hardware Trojan. All three proposed types of TAAL attacks can be used for extracting secret keys. We have introduced the models for both the combinational and sequential hardware Trojans that evade manufacturing tests. An adversary only needs to choose one hardware Trojan out of a large set of all possible Trojans to launch the TAAL attack.
CVNov 19, 2018
Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency DetectionZiqi Zhou, Zheng Wang, Huchuan Lu et al.
The existing still-static deep learning based saliency researches do not consider the weighting and highlighting of extracted features from different layers, all features contribute equally to the final saliency decision-making. Such methods always evenly detect all "potentially significant regions" and unable to highlight the key salient object, resulting in detection failure of dynamic scenes. In this paper, based on the fact that salient areas in videos are relatively small and concentrated, we propose a \textbf{key salient object re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up feature guidance} to improve detection accuracy in video scenes. KSORA includes two sub-modules (WFE and KOS): WFE processes local salient feature selection using bottom-up strategy, while KOS ranks each object in global fashion by top-down statistical knowledge, and chooses the most critical object area for local enhancement. The proposed KSORA can not only strengthen the saliency value of the local key salient object but also ensure global saliency consistency. Results on three benchmark datasets suggest that our model has the capability of improving the detection accuracy on complex scenes. The significant performance of KSORA, with a speed of 17FPS on modern GPUs, has been verified by comparisons with other ten state-of-the-art algorithms.