Jiyi Zhang

LG
h-index24
8papers
69citations
Novelty51%
AI Score27

8 Papers

CRDec 31, 2022
Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence

Han Fang, Jiyi Zhang, Yupeng Qiu et al.

Deep neural networks are vulnerable to adversarial attacks. In this paper, we take the role of investigators who want to trace the attack and identify the source, that is, the particular model which the adversarial examples are generated from. Techniques derived would aid forensic investigation of attack incidents and serve as deterrence to potential attacks. We consider the buyers-seller setting where a machine learning model is to be distributed to various buyers and each buyer receives a slightly different copy with same functionality. A malicious buyer generates adversarial examples from a particular copy $\mathcal{M}_i$ and uses them to attack other copies. From these adversarial examples, the investigator wants to identify the source $\mathcal{M}_i$. To address this problem, we propose a two-stage separate-and-trace framework. The model separation stage generates multiple copies of a model for a same classification task. This process injects unique characteristics into each copy so that adversarial examples generated have distinct and traceable features. We give a parallel structure which embeds a ``tracer'' in each copy, and a noise-sensitive training loss to achieve this goal. The tracing stage takes in adversarial examples and a few candidate models, and identifies the likely source. Based on the unique features induced by the noise-sensitive loss function, we could effectively trace the potential adversarial copy by considering the output logits from each tracer. Empirical results show that it is possible to trace the origin of the adversarial example and the mechanism can be applied to a wide range of architectures and datasets.

LGJun 2, 2023
Adaptive Attractors: A Defense Strategy against ML Adversarial Collusion Attacks

Jiyi Zhang, Han Fang, Ee-Chien Chang

In the seller-buyer setting on machine learning models, the seller generates different copies based on the original model and distributes them to different buyers, such that adversarial samples generated on one buyer's copy would likely not work on other copies. A known approach achieves this using attractor-based rewriter which injects different attractors to different copies. This induces different adversarial regions in different copies, making adversarial samples generated on one copy not replicable on others. In this paper, we focus on a scenario where multiple malicious buyers collude to attack. We first give two formulations and conduct empirical studies to analyze effectiveness of collusion attack under different assumptions on the attacker's capabilities and properties of the attractors. We observe that existing attractor-based methods do not effectively mislead the colluders in the sense that adversarial samples found are influenced more by the original model instead of the attractors as number of colluders increases. Based on this observation, we propose using adaptive attractors whose weight is guided by a U-shape curve to cover the shortfalls. Experimentation results show that when using our approach, the attack success rate of a collusion attack converges to around 15% even when lots of copies are applied for collusion. In contrast, when using the existing attractor-based rewriter with fixed weight, the attack success rate increases linearly with the number of copies used for collusion.

CLFeb 21, 2024
Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

Xiaoxia Li, Siyuan Liang, Jiyi Zhang et al.

Large Language Models (LLMs), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.

LGFeb 7, 2024
Domain Bridge: Generative model-based domain forensic for black-box models

Jiyi Zhang, Han Fang, Ee-Chien Chang

In forensic investigations of machine learning models, techniques that determine a model's data domain play an essential role, with prior work relying on large-scale corpora like ImageNet to approximate the target model's domain. Although such methods are effective in finding broad domains, they often struggle in identifying finer-grained classes within those domains. In this paper, we introduce an enhanced approach to determine not just the general data domain (e.g., human face) but also its specific attributes (e.g., wearing glasses). Our approach uses an image embedding model as the encoder and a generative model as the decoder. Beginning with a coarse-grained description, the decoder generates a set of images, which are then presented to the unknown target model. Successful classifications by the model guide the encoder to refine the description, which in turn, are used to produce a more specific set of images in the subsequent iteration. This iterative refinement narrows down the exact class of interest. A key strength of our approach lies in leveraging the expansive dataset, LAION-5B, on which the generative model Stable Diffusion is trained. This enlarges our search space beyond traditional corpora, such as ImageNet. Empirical results showcase our method's performance in identifying specific attributes of a model's input domain, paving the way for more detailed forensic analyses of deep learning models.

LGMay 10, 2023
Finding Meaningful Distributions of ML Black-boxes under Forensic Investigation

Jiyi Zhang, Han Fang, Hwee Kuan Lee et al.

Given a poorly documented neural network model, we take the perspective of a forensic investigator who wants to find out the model's data domain (e.g. whether on face images or traffic signs). Although existing methods such as membership inference and model inversion can be used to uncover some information about an unknown model, they still require knowledge of the data domain to start with. In this paper, we propose solving this problem by leveraging on comprehensive corpus such as ImageNet to select a meaningful distribution that is close to the original training distribution and leads to high performance in follow-up investigations. The corpus comprises two components, a large dataset of samples and meta information such as hierarchical structure and textual information on the samples. Our goal is to select a set of samples from the corpus for the given model. The core of our method is an objective function that considers two criteria on the selected samples: the model functional properties (derived from the dataset), and semantics (derived from the metadata). We also give an algorithm to efficiently search the large space of all possible subsets w.r.t. the objective function. Experimentation results show that the proposed method is effective. For example, cloning a given model (originally trained with CIFAR-10) by using Caltech 101 can achieve 45.5% accuracy. By using datasets selected by our method, the accuracy is improved to 72.0%.

CRNov 30, 2021
Mitigating Adversarial Attacks by Distributing Different Copies to Different Users

Jiyi Zhang, Han Fang, Wesley Joon-Wie Tann et al.

Machine learning models are vulnerable to adversarial attacks. In this paper, we consider the scenario where a model is distributed to multiple buyers, among which a malicious buyer attempts to attack another buyer. The malicious buyer probes its copy of the model to search for adversarial samples and then presents the found samples to the victim's copy of the model in order to replicate the attack. We point out that by distributing different copies of the model to different buyers, we can mitigate the attack such that adversarial samples found on one copy would not work on another copy. We observed that training a model with different randomness indeed mitigates such replication to a certain degree. However, there is no guarantee and retraining is computationally expensive. A number of works extended the retraining method to enhance the differences among models. However, a very limited number of models can be produced using such methods and the computational cost becomes even higher. Therefore, we propose a flexible parameter rewriting method that directly modifies the model's parameters. This method does not require additional training and is able to generate a large number of copies in a more controllable manner, where each copy induces different adversarial regions. Experimentation studies show that rewriting can significantly mitigate the attacks while retaining high classification accuracy. For instance, on GTSRB dataset with respect to Hop Skip Jump attack, using attractor-based rewriter can reduce the success rate of replicating the attack to 0.5% while independently training copies with different randomness can reduce the success rate to 6.5%. From this study, we believe that there are many further directions worth exploring.

CRMar 5, 2020
Confusing and Detecting ML Adversarial Attacks with Injected Attractors

Jiyi Zhang, Ee-Chien Chang, Hwee Kuan Lee

Many machine learning adversarial attacks find adversarial samples of a victim model ${\mathcal M}$ by following the gradient of some attack objective functions, either explicitly or implicitly. To confuse and detect such attacks, we take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals, or to some designated regions that can be easily picked up by an analyzer. To achieve this goal, we propose adding a large number of artifacts, which we called $attractors$, onto the otherwise smooth function. An attractor is a point in the input space, where samples in its neighborhood have gradient pointing toward it. We observe that decoders of watermarking schemes exhibit properties of attractors and give a generic method that injects attractors from a watermark decoder into the victim model ${\mathcal M}$. This principled approach allows us to leverage on known watermarking schemes for scalability and robustness and provides explainability of the outcomes. Experimental studies show that our method has competitive performance. For instance, for un-targeted attacks on CIFAR-10 dataset, we can reduce the overall attack success rate of DeepFool to 1.9%, whereas known defense LID, FS and MagNet can reduce the rate to 90.8%, 98.5% and 78.5% respectively.

LGFeb 13, 2018
Flipped-Adversarial AutoEncoders

Jiyi Zhang, Hung Dang, Hwee Kuan Lee et al.

We propose a flipped-Adversarial AutoEncoder (FAAE) that simultaneously trains a generative model G that maps an arbitrary latent code distribution to a data distribution and an encoder E that embodies an "inverse mapping" that encodes a data sample into a latent code vector. Unlike previous hybrid approaches that leverage adversarial training criterion in constructing autoencoders, FAAE minimizes re-encoding errors in the latent space and exploits adversarial criterion in the data space. Experimental evaluations demonstrate that the proposed framework produces sharper reconstructed images while at the same time enabling inference that captures rich semantic representation of data.