CLDec 18, 2025Code
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language ModelsKai Hu, Abhinav Aggarwal, Mehran Khodabandeh et al.
This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.
LGJul 7, 2021
Reconstructing Test Labels from Noisy Loss FunctionsAbhinav Aggarwal, Shiva Prasad Kasiviswanathan, Zekun Xu et al.
Machine learning classifiers rely on loss functions for performance evaluation, often on a private (hidden) dataset. In a recent line of research, label inference was introduced as the problem of reconstructing the ground truth labels of this private dataset from just the (possibly perturbed) cross-entropy loss function values evaluated at chosen prediction vectors (without any other access to the hidden dataset). In this paper, we formally study the necessary and sufficient conditions under which label inference is possible from \emph{any} (noisy) loss function value. Using tools from analytical number theory, we show that a broad class of commonly used loss functions, including general Bregman divergence-based losses and multiclass cross-entropy with common activation functions like sigmoid and softmax, it is possible to design label inference attacks that succeed even for arbitrary noise levels and using only a single query from the adversary. We formally study the computational complexity of label inference and show that while in general, designing adversarial prediction vectors for these attacks is co-NP-hard, once we have these vectors, the attacks can also be carried out through a lightweight augmentation to any neural network model, making them look benign and hard to detect. The observations in this paper provide a deeper understanding of the vulnerabilities inherent in modern machine learning and could be used for designing future trustworthy ML.
LGMay 18, 2021
Label Inference Attacks from Log-loss ScoresAbhinav Aggarwal, Shiva Prasad Kasiviswanathan, Zekun Xu et al.
Log-loss (also known as cross-entropy loss) metric is ubiquitously used across machine learning applications to assess the performance of classification algorithms. In this paper, we investigate the problem of inferring the labels of a dataset from single (or multiple) log-loss score(s), without any other access to the dataset. Surprisingly, we show that for any finite number of label classes, it is possible to accurately infer the labels of the dataset from the reported log-loss score of a single carefully constructed prediction vector if we allow arbitrary precision arithmetic. Additionally, we present label inference algorithms (attacks) that succeed even under addition of noise to the log-loss scores and under limited precision arithmetic. All our algorithms rely on ideas from number theory and combinatorics and require no model training. We run experimental simulations on some real datasets to demonstrate the ease of running these attacks in practice.
CLApr 23, 2021
On a Utilitarian Approach to Privacy Preserving Text GenerationZekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan et al.
Differentially-private mechanisms for text generation typically add carefully calibrated noise to input words and use the nearest neighbor to the noised input as the output word. When the noise is small in magnitude, these mechanisms are susceptible to reconstruction of the original sensitive text. This is because the nearest neighbor to the noised input is likely to be the original input. To mitigate this empirical privacy risk, we propose a novel class of differentially private mechanisms that parameterizes the nearest neighbor selection criterion in traditional mechanisms. Motivated by Vickrey auction, where only the second highest price is revealed and the highest price is kept private, we balance the choice between the first and the second nearest neighbors in the proposed class of mechanisms using a tuning parameter. This parameter is selected by empirically solving a constrained optimization problem for maximizing utility, while maintaining the desired privacy guarantees. We argue that this empirical measurement framework can be used to align different mechanisms along a common benchmark for their privacy-utility tradeoff, particularly when different distance metrics are used to calibrate the amount of noise added. Our experiments on real text classification datasets show up to 50% improvement in utility compared to the existing state-of-the-art with the same empirical privacy guarantee.
LGDec 10, 2020
Research Challenges in Designing Differentially Private Text Generation MechanismsOluwaseyi Feyisetan, Abhinav Aggarwal, Zekun Xu et al.
Accurately learning from user data while ensuring quantifiable privacy guarantees provides an opportunity to build better Machine Learning (ML) models while maintaining user trust. Recent literature has demonstrated the applicability of a generalized form of Differential Privacy to provide guarantees over text queries. Such mechanisms add privacy preserving noise to vectorial representations of text in high dimension and return a text based projection of the noisy vectors. However, these mechanisms are sub-optimal in their trade-off between privacy and utility. This is due to factors such as a fixed global sensitivity which leads to too much noise added in dense spaces while simultaneously guaranteeing protection for sensitive outliers. In this proposal paper, we describe some challenges in balancing the tradeoff between privacy and utility for these differentially private text mechanisms. At a high level, we provide two proposals: (1) a framework called LAC which defers some of the noise to a privacy amplification step and (2), an additional suite of three different techniques for calibrating the noise based on the local region around a word. Our objective in this paper is not to evaluate a single solution but to further the conversation on these challenges and chart pathways for building better mechanisms.
CLOct 22, 2020
A Differentially Private Text Perturbation Method Using a Regularized Mahalanobis MetricZekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan et al.
Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy-preserving text analysis is noise injection, in which text data is first mapped into a continuous embedding space, perturbed by sampling a spherical noise from an appropriate distribution, and then projected back to the discrete vocabulary space. While this allows the perturbation to admit the required metric differential privacy, often the utility of downstream tasks modeled on this perturbed data is low because the spherical noise does not account for the variability in the density around different words in the embedding space. In particular, words in a sparse region are likely unchanged even when the noise scale is large. %Using the global sensitivity of the mechanism can potentially add too much noise to the words in the dense regions of the embedding space, causing a high utility loss, whereas using local sensitivity can leak information through the scale of the noise added. In this paper, we propose a text perturbation mechanism based on a carefully designed regularized variant of the Mahalanobis metric to overcome this problem. For any given noise scale, this metric adds an elliptical noise to account for the covariance structure in the embedding space. This heterogeneity in the noise scale along different directions helps ensure that the words in the sparse region have sufficient likelihood of replacement without sacrificing the overall utility. We provide a text-perturbation algorithm based on this metric and formally prove its privacy guarantees. Additionally, we empirically show that our mechanism improves the privacy statistics to achieve the same level of utility as compared to the state-of-the-art Laplace mechanism.
LGSep 27, 2020
Differentially Private Adversarial Robustness Through Randomized PerturbationsNan Xu, Oluwaseyi Feyisetan, Abhinav Aggarwal et al.
Deep Neural Networks, despite their great success in diverse domains, are provably sensitive to small perturbations on correctly classified examples and lead to erroneous predictions. Recently, it was proposed that this behavior can be combatted by optimizing the worst case loss function over all possible substitutions of training examples. However, this can be prone to weighing unlikely substitutions higher, limiting the accuracy gain. In this paper, we study adversarial robustness through randomized perturbations, which has two immediate advantages: (1) by ensuring that substitution likelihood is weighted by the proximity to the original word, we circumvent optimizing the worst case guarantees and achieve performance gains; and (2) the calibrated randomness imparts differentially-private model training, which additionally improves robustness against adversarial attacks on the model outputs. Our approach uses a novel density-based mechanism based on truncated Gumbel noise, which ensures training on substitutions of both rare and dense words in the vocabulary while maintaining semantic similarity for model robustness.
LGSep 17, 2020
On Primes, Log-Loss Scores and (No) PrivacyAbhinav Aggarwal, Zekun Xu, Oluwaseyi Feyisetan et al.
Membership Inference Attacks exploit the vulnerabilities of exposing models trained on customer data to queries by an adversary. In a recently proposed implementation of an auditing tool for measuring privacy leakage from sensitive datasets, more refined aggregates like the Log-Loss scores are exposed for simulating inference attacks as well as to assess the total privacy leakage based on the adversary's predictions. In this paper, we prove that this additional information enables the adversary to infer the membership of any number of datapoints with full accuracy in a single query, causing complete membership privacy breach. Our approach obviates any attack model training or access to side knowledge with the adversary. Moreover, our algorithms are agnostic to the model under attack and hence, enable perfect membership inference even for models that do not memorize or overfit. In particular, our observations provide insight into the extent of information leakage from statistical aggregates and how they can be exploited.
DCSep 1, 2020
LoCUS: A multi-robot loss-tolerant algorithm for surveying volcanic plumesJohn Erickson, Abhinav Aggarwal, G. Matthew Fricke et al.
Measurement of volcanic CO2 flux by a drone swarm poses special challenges. Drones must be able to follow gas concentration gradients while tolerating frequent drone loss. We present the LoCUS algorithm as a solution to this problem and prove its robustness. LoCUS relies on swarm coordination and self-healing to solve the task. As a point of contrast we also implement the MoBS algorithm, derived from previously published work, which allows drones to solve the task independently. We compare the effectiveness of these algorithms using drone simulations, and find that LoCUS provides a reliable and efficient solution to the volcano survey problem. Further, the novel data-structures and algorithms underpinning LoCUS have application in other areas of fault-tolerant algorithm research.
LGAug 6, 2018
Paying Attention to Attention: Highlighting Influential Samples in Sequential AnalysisCynthia Freeman, Jonathan Merriman, Abhinav Aggarwal et al.
In (Yang et al. 2016), a hierarchical attention network (HAN) is created for document classification. The attention layer can be used to visualize text influential in classifying the document, thereby explaining the model's prediction. We successfully applied HAN to a sequential analysis task in the form of real-time monitoring of turn taking in conversations. However, we discovered instances where the attention weights were uniform at the stopping point (indicating all turns were equivalently influential to the classifier), preventing meaningful visualization for real-time human review or classifier improvement. We observed that attention weights for turns fluctuated as the conversations progressed, indicating turns had varying influence based on conversation state. Leveraging this observation, we develop a method to create more informative real-time visuals (as confirmed by human reviewers) in cases of uniform attention weights using the changes in turn importance as a conversation progresses over time.
CRSep 14, 2017
REMOTEGATE: Incentive-Compatible Remote Configuration of Security GatewaysAbhinav Aggarwal, Mahdi Zamani, Mihai Christodorescu
Imagine that a malicious hacker is trying to attack a server over the Internet and the server wants to block the attack packets as close to their point of origin as possible. However, the security gateway ahead of the source of attack is untrusted. How can the server block the attack packets through this gateway? In this paper, we introduce REMOTEGATE, a trustworthy mechanism for allowing any party (server) on the Internet to configure a security gateway owned by a second party, at a certain agreed upon reward that the former pays to the latter for its service. We take an interactive incentive-compatible approach, for the case when both the server and the gateway are rational, to devise a protocol that will allow the server to help the security gateway generate and deploy a policy rule that filters the attack packets before they reach the server. The server will reward the gateway only when the latter can successfully verify that it has generated and deployed the correct rule for the issue. This mechanism will enable an Internet-scale approach to improving security and privacy, backed by digital payment incentives.
CRDec 18, 2016
Distributed Computing with Channel NoiseAbhinav Aggarwal, Varsha Dani, Thomas P. Hayes et al.
A group of $n$ users want to run a distributed protocol $π$ over a network where communication occurs via private point-to-point channels. Unfortunately, an adversary, who knows $π$, is able to maliciously flip bits on the channels. Can we efficiently simulate $π$ in the presence of such an adversary? We show that this is possible, even when $L$, the number of bits sent in $π$, and $T$, the number of bits flipped by the adversary are not known in advance. In particular, we show how to create a robust version of $π$ that 1) fails with probability at most $δ$, for any $δ>0$; and 2) sends $\tilde{O}(L + T)$ bits, where the $\tilde{O}$ notation hides a $\log (nL/ δ)$ term multiplying $L$. Additionally, we show how to improve this result when the average message size $α$ is not constant. In particular, we give an algorithm that sends $O( L (1 + (1/α) \log (n L/δ) + T)$ bits. This algorithm is adaptive in that it does not require a priori knowledge of $α$. We note that if $α$ is $Ω\left( \log (n L/δ) \right)$, then this improved algorithm sends only $O(L+T)$ bits, and is therefore within a constant factor of optimal.
CRMay 15, 2016
Sending a Message with Unknown NoiseAbhinav Aggarwal, Varsha Dani, Thomas Hayes et al.
Alice and Bob are connected via a two-way channel, and Alice wants to send a message of $L$ bits to Bob. An adversary flips an arbitrary but finite number of bits, $T$, on the channel. This adversary knows our algorithm and Alice's message, but does not know any private random bits generated by Alice or Bob, nor the bits sent over the channel, except when these bits can be predicted by knowledge of Alice's message or our algorithm. We want Bob to receive Alice's message and for both players to terminate, with error probability at most $δ> 0$, where $δ$ is a parameter known to both Alice and Bob. Unfortunately, the value $T$ is unknown in advance to either Alice or Bob, and the value $L$ is unknown in advance to Bob. We describe an algorithm to solve the above problem while sending an expected $L + O \left( T + \min \left(T+1,\frac{L}{\log L} \right) \log \left( \frac{L}δ \right) \right)$ bits. A special case is when $δ= O(1/L^c)$, for some constant $c$. Then when $T = o(L/\log L)$, the expected number of bits sent is $L + o(L)$, and when $T = Ω(L)$, the expected number of bits sent is $L + O\left( T \right)$, which is asymptotically optimal.
CROct 9, 2013
Algebraic Message Authentication CodesAbhinav Aggarwal
This paper suggests a message authentication scheme, which can be efficiently used for secure digital signature creation. The algorithm used here is an adjusted union of the concepts which underlie projective geometry and group structure on circles. The authentication is done through a key, which iterates over the complete message string to produce the signature. The iteration is not only based on the frequency distribution of the message string alphabet, but also on the probability of occurrence of another given reference string in the message. The complete process can be easily computed in a small time, producing signatures which are highly dependent on the message string. Consequently, the odds in favor of existence of a forgery are highly reduced.