Zhen Xiang

LG
h-index16
21papers
935citations
Novelty58%
AI Score46

21 Papers

6.6LGAug 8, 2023Code
Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

Hang Wang, Zhen Xiang, David J. Miller et al.

Deep neural networks are vulnerable to backdoor attacks (Trojans), where an attacker poisons the training set with backdoor triggers so that the neural network learns to classify test-time triggers to the attacker's designated target class. Recent work shows that backdoor poisoning induces over-fitting (abnormally large activations) in the attacked model, which motivates a general, post-training clipping method for backdoor mitigation, i.e., with bounds on internal-layer activations learned using a small set of clean samples. We devise a new such approach, choosing the activation bounds to explicitly limit classification margins. This method gives superior performance against peer methods for CIFAR-10 image classification. We also show that this method has strong robustness against adaptive attacks, X2X attacks, and on different datasets. Finally, we demonstrate a method extension for test-time detection and correction based on the output differences between the original and activation-bounded networks. The code of our method is online available.

3.8LGAug 18, 2023
Backdoor Mitigation by Correcting the Distribution of Neural Activations

Xi Li, Zhen Xiang, David J. Miller et al.

Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is corrected. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities in the act of exploiting the backdoor.

33.4CLFeb 19, 2024Code
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Fengqing Jiang, Zhangchen Xu, Luyao Niu et al. · uw

Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs. Our code is available at https://github.com/uw-nsl/ArtPrompt.

14.9LGOct 26, 2023Code
CBD: A Certified Backdoor Detector Based on Local Dominant Probability

Zhen Xiang, Zidi Xiong, Bo Li

Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.

8.6CRApr 22, 2025Code
Large Language Model Empowered Privacy-Protected Framework for PHI Annotation in Clinical Notes

Guanchen Wu, Linzhi Zheng, Han Xie et al.

The de-identification of private information in medical data is a crucial process to mitigate the risk of confidentiality breaches, particularly when patient personal details are not adequately removed before the release of medical records. Although rule-based and learning-based methods have been proposed, they often struggle with limited generalizability and require substantial amounts of annotated data for effective performance. Recent advancements in large language models (LLMs) have shown significant promise in addressing these issues due to their superior language comprehension capabilities. However, LLMs present challenges, including potential privacy risks when using commercial LLM APIs and high computational costs for deploying open-source LLMs locally. In this work, we introduce LPPA, an LLM-empowered Privacy-Protected PHI Annotation framework for clinical notes, targeting the English language. By fine-tuning LLMs locally with synthetic notes, LPPA ensures strong privacy protection and high PHI annotation accuracy. Extensive experiments demonstrate LPPA's effectiveness in accurately de-identifying private information, offering a scalable and efficient solution for enhancing patient privacy protection.

39.9AIFeb 17, 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

Fengqing Jiang, Zhangchen Xu, Yuetai Li et al. · uw

Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.

36.0CRFeb 17, 2025
Unveiling Privacy Risks in LLM Agent Memory

Bo Wang, Weiyi He, Shenglai Zeng et al.

Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer's and the attacker's perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.

12.5CRDec 9, 2024Code
Data Free Backdoor Attacks

Bochuan Cao, Jinyuan Jia, Chuxuan Hu et al.

Backdoor attacks aim to inject a backdoor into a classifier such that it predicts any input with an attacker-chosen backdoor trigger as an attacker-chosen target class. Existing backdoor attacks require either retraining the classifier with some clean data or modifying the model's architecture. As a result, they are 1) not applicable when clean data is unavailable, 2) less efficient when the model is large, and 3) less stealthy due to architecture changes. In this work, we propose DFBA, a novel retraining-free and data-free backdoor attack without changing the model architecture. Technically, our proposed method modifies a few parameters of a classifier to inject a backdoor. Through theoretical analysis, we verify that our injected backdoor is provably undetectable and unremovable by various state-of-the-art defenses under mild assumptions. Our evaluation on multiple datasets further demonstrates that our injected backdoor: 1) incurs negligible classification loss, 2) achieves 100% attack success rates, and 3) bypasses six existing state-of-the-art defenses. Moreover, our comparison with a state-of-the-art non-data-free backdoor attack shows our attack is more stealthy and effective against various defenses while achieving less classification accuracy loss.

4.9CLAug 2, 2025
Adaptive Content Restriction for Large Language Models via Suffix Optimization

Yige Li, Peihai Jiang, Jun Sun et al.

Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called \textit{Adaptive Content Restriction} (AdaCoRe), which focuses on lightweight strategies -- methods without model fine-tuning -- to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named \textit{Suffix Optimization (SOP)}, which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new \textit{Content Restriction Benchmark} (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15\%, 17\%, 10\%, 9\%, and 6\% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.

34.4CRJan 20, 2024Code
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Zhen Xiang, Fengqing Jiang, Zidi Xiong et al.

Large language models (LLMs) are shown to benefit from chain-of-thought (COT) prompting, particularly when tackling tasks that require systematic reasoning processes. On the other hand, COT prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. Traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. However, these approaches are not practical for commercial LLMs that typically operate via API access. In this paper, we propose BadChain, the first backdoor attack against LLMs employing COT prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. BadChain leverages the inherent reasoning capabilities of LLMs by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. Empirically, we show the effectiveness of BadChain for two COT strategies across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on GPT-4. Finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against BadChain. Therefore, BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.

15.5LGMay 29, 2023
UMD: Unsupervised Model Detection for X2X Backdoor Attacks

Zhen Xiang, Zidi Xiong, Bo Li

Backdoor (Trojan) attack is a common threat to deep neural networks, where samples from one or more source classes embedded with a backdoor trigger will be misclassified to adversarial target classes. Existing methods for detecting whether a classifier is backdoor attacked are mostly designed for attacks with a single adversarial target (e.g., all-to-one attack). To the best of our knowledge, without supervision, no existing methods can effectively address the more general X2X attack with an arbitrary number of source classes, each paired with an arbitrary target class. In this paper, we propose UMD, the first Unsupervised Model Detection method that effectively detects X2X backdoor attacks via a joint inference of the adversarial (source, target) class pairs. In particular, we first define a novel transferability statistic to measure and select a subset of putative backdoor class pairs based on a proposed clustering approach. Then, these selected class pairs are jointly assessed based on an aggregation of their reverse-engineered trigger size for detection inference, using a robust and unsupervised anomaly detector we proposed. We conduct comprehensive evaluations on CIFAR-10, GTSRB, and Imagenette dataset, and show that our unsupervised UMD outperforms SOTA detectors (even with supervision) by 17%, 4%, and 8%, respectively, in terms of the detection accuracy against diverse X2X attacks. We also show the strong detection performance of UMD against several strong adaptive attacks.

17.0CRDec 6, 2021
Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

Xi Li, Zhen Xiang, David J. Miller et al.

Backdoor (Trojan) attacks are emerging threats against deep neural networks (DNN). A DNN being attacked will predict to an attacker-desired target class whenever a test sample from any source class is embedded with a backdoor pattern; while correctly classifying clean (attack-free) test samples. Existing backdoor defenses have shown success in detecting whether a DNN is attacked and in reverse-engineering the backdoor pattern in a "post-training" regime: the defender has access to the DNN to be inspected and a small, clean dataset collected independently, but has no access to the (possibly poisoned) training set of the DNN. However, these defenses neither catch culprits in the act of triggering the backdoor mapping, nor mitigate the backdoor attack at test-time. In this paper, we propose an "in-flight" defense against backdoor attacks on image classification that 1) detects use of a backdoor trigger at test-time; and 2) infers the class of origin (source class) for a detected trigger example. The effectiveness of our defense is demonstrated experimentally against different strong backdoor attacks.

1.6LGMay 28, 2021
A BIC-based Mixture Model Defense against Data Poisoning Attacks on Classifiers

Xi Li, David J. Miller, Zhen Xiang et al.

Data Poisoning (DP) is an effective attack that causes trained classifiers to misclassify their inputs. DP attacks significantly degrade a classifier's accuracy by covertly injecting attack samples into the training set. Broadly applicable to different classifier structures, without strong assumptions about the attacker, an {\it unsupervised} Bayesian Information Criterion (BIC)-based mixture model defense against "error generic" DP attacks is herein proposed that: 1) addresses the most challenging {\it embedded} DP scenario wherein, if DP is present, the poisoned samples are an {\it a priori} unknown subset of the training set, and with no clean validation set available; 2) applies a mixture model both to well-fit potentially multi-modal class distributions and to capture poisoned samples within a small subset of the mixture components; 3) jointly identifies poisoned components and samples by minimizing the BIC cost defined over the whole training set, with the identified poisoned data removed prior to classifier training. Our experimental results, for various classifier structures and benchmark datasets, demonstrate the effectiveness and universality of our defense under strong DP attacks, as well as its superiority over other works.

32.3CRApr 12, 2021
A Backdoor Attack against 3D Point Cloud Classifiers

Zhen Xiang, David J. Miller, Siheng Chen et al.

Vulnerability of 3D point cloud (PC) classifiers has become a grave concern due to the popularity of 3D sensors in safety-critical applications. Existing adversarial attacks against 3D PC classifiers are all test-time evasion (TTE) attacks that aim to induce test-time misclassifications using knowledge of the classifier. But since the victim classifier is usually not accessible to the attacker, the threat is largely diminished in practice, as PC TTEs typically have poor transferability. Here, we propose the first backdoor attack (BA) against PC classifiers. Originally proposed for images, BAs poison the victim classifier's training set so that the classifier learns to decide to the attacker's target class whenever the attacker's backdoor pattern is present in a given input sample. Significantly, BAs do not require knowledge of the victim classifier. Different from image BAs, we propose to insert a cluster of points into a PC as a robust backdoor pattern customized for 3D PCs. Such clusters are also consistent with a physical attack (i.e., with a captured object in a scene). We optimize the cluster's location using an independently trained surrogate classifier and choose the cluster's local geometry to evade possible PC preprocessing and PC anomaly detectors (ADs). Experimentally, our BA achieves a uniformly high success rate (> 87%) and shows evasiveness against state-of-the-art PC ADs.

6.5CVOct 20, 2020
L-RED: Efficient Post-Training Detection of Imperceptible Backdoor Attacks without Access to the Training Set

Zhen Xiang, David J. Miller, George Kesidis

Backdoor attacks (BAs) are an emerging form of adversarial attack typically against deep neural network image classifiers. The attacker aims to have the classifier learn to classify to a target class when test images from one or more source classes contain a backdoor pattern, while maintaining high accuracy on all clean test images. Reverse-Engineering-based Defenses (REDs) against BAs do not require access to the training set but only to an independent clean dataset. Unfortunately, most existing REDs rely on an unrealistic assumption that all classes except the target class are source classes of the attack. REDs that do not rely on this assumption often require a large set of clean images and heavy computation. In this paper, we propose a Lagrangian-based RED (L-RED) that does not require knowledge of the number of source classes (or whether an attack is present). Our defense requires very few clean images to effectively detect BAs and is computationally efficient. Notably, we detect 56 out of 60 BAs using only two clean images per class in our experiments on CIFAR-10.

10.1LGOct 15, 2020
Reverse Engineering Imperceptible Backdoor Attacks on Deep Neural Networks for Detection and Training Set Cleansing

Zhen Xiang, David J. Miller, George Kesidis

Backdoor data poisoning is an emerging form of adversarial attack usually against deep neural network image classifiers. The attacker poisons the training set with a relatively small set of images from one (or several) source class(es), embedded with a backdoor pattern and labeled to a target class. For a successful attack, during operation, the trained classifier will: 1) misclassify a test image from the source class(es) to the target class whenever the same backdoor pattern is present; 2) maintain a high classification accuracy for backdoor-free test images. In this paper, we make a break-through in defending backdoor attacks with imperceptible backdoor patterns (e.g. watermarks) before/during the training phase. This is a challenging problem because it is a priori unknown which subset (if any) of the training set has been poisoned. We propose an optimization-based reverse-engineering defense, that jointly: 1) detects whether the training set is poisoned; 2) if so, identifies the target class and the training images with the backdoor pattern embedded; and 3) additionally, reversely engineers an estimate of the backdoor pattern used by the attacker. In benchmark experiments on CIFAR-10, for a large variety of attacks, our defense achieves a new state-of-the-art by reducing the attack success rate to no more than 4.9% after removing detected suspicious training images.

8.6LGNov 18, 2019
Revealing Perceptible Backdoors, without the Training Set, via the Maximum Achievable Misclassification Fraction Statistic

Zhen Xiang, David J. Miller, Hang Wang et al.

Recently, a backdoor data poisoning attack was proposed, which adds mislabeled examples to the training set, with an embedded backdoor pattern, aiming to have the classifier learn to classify to a target class whenever the backdoor pattern is present in a test sample. Here, we address post-training detection of innocuous perceptible backdoors in DNN image classifiers, wherein the defender does not have access to the poisoned training set, but only to the trained classifier, as well as unpoisoned examples. This problem is challenging because without the poisoned training set, we have no hint about the actual backdoor pattern used during training. This post-training scenario is also of great import because in many practical contexts the DNN user did not train the DNN and does not have access to the training data. We identify two important properties of perceptible backdoor patterns - spatial invariance and robustness - based upon which we propose a novel detector using the maximum achievable misclassification fraction (MAMF) statistic. We detect whether the trained DNN has been backdoor-attacked and infer the source and target classes. Our detector outperforms other existing detectors and, coupled with an imperceptible backdoor detector, helps achieve post-training detection of all evasive backdoors.

9.9LGApr 12, 2019
Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks

David J. Miller, Zhen Xiang, George Kesidis

There is great potential for damage from adversarial learning (AL) attacks on machine-learning based systems. In this paper, we provide a contemporary survey of AL, focused particularly on defenses against attacks on statistical classifiers. After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasion (TTE), data poisoning (DP), and reverse engineering (RE) attacks and particularly defenses against same. In so doing, we distinguish robust classification from anomaly detection (AD), unsupervised from supervised, and statistical hypothesis-based defenses from ones that do not have an explicit null (no attack) hypothesis; we identify the hyperparameters a particular method requires, its computational complexity, as well as the performance measures on which it was evaluated and the obtained quality. We then dig deeper, providing novel insights that challenge conventional AL wisdom and that target unresolved issues, including: 1) robust classification versus AD as a defense strategy; 2) the belief that attack success increases with attack strength, which ignores susceptibility to AD; 3) small perturbations for test-time evasion attacks: a fallacy or a requirement?; 4) validity of the universal assumption that a TTE attacker knows the ground-truth class for the example to be attacked; 5) black, grey, or white box attacks as the standard for defense evaluation; 6) susceptibility of query-based RE to an AD defense. We also discuss attacks on the privacy of training data. We then present benchmark comparisons of several defenses against TTE, RE, and backdoor DP attacks on images. The paper concludes with a discussion of future work.

4.2CROct 31, 2018
A Mixture Model Based Defense for Data Poisoning Attacks Against Naive Bayes Spam Filters

David J. Miller, Xinyi Hu, Zhen Xiang et al.

Naive Bayes spam filters are highly susceptible to data poisoning attacks. Here, known spam sources/blacklisted IPs exploit the fact that their received emails will be treated as (ground truth) labeled spam examples, and used for classifier training (or re-training). The attacking source thus generates emails that will skew the spam model, potentially resulting in great degradation in classifier accuracy. Such attacks are successful mainly because of the poor representation power of the naive Bayes (NB) model, with only a single (component) density to represent spam (plus a possible attack). We propose a defense based on the use of a mixture of NB models. We demonstrate that the learned mixture almost completely isolates the attack in a second NB component, with the original spam component essentially unchanged by the attack. Our approach addresses both the scenario where the classifier is being re-trained in light of new data and, significantly, the more challenging scenario where the attack is embedded in the original spam training set. Even for weak attack strengths, BIC-based model order selection chooses a two-component solution, which invokes the mixture-based defense. Promising results are presented on the TREC 2005 spam corpus.

20.9LGMay 19, 2014
Screening Tests for Lasso Problems

Zhen James Xiang, Yun Wang, Peter J. Ramadge

This paper is a survey of dictionary screening for the lasso problem. The lasso problem seeks a sparse linear combination of the columns of a dictionary to best match a given target vector. This sparse representation has proven useful in a variety of subsequent processing and decision tasks. For a given target vector, dictionary screening quickly identifies a subset of dictionary columns that will receive zero weight in a solution of the corresponding lasso problem. These columns can be removed from the dictionary prior to solving the lasso problem without impacting the optimality of the solution obtained. This has two potential advantages: it reduces the size of the dictionary, allowing the lasso problem to be solved with less resources, and it may speed up obtaining a solution. Using a geometrically intuitive framework, we provide basic insights for understanding useful lasso screening tests and their limitations. We also provide illustrative numerical studies on several datasets.