Saeed Mahloujifar

LG
h-index51
46papers
1,766citations
Novelty60%
AI Score58

46 Papers

LGApr 28, 2022Code
Formulating Robustness Against Unforeseen Attacks

Sihui Dai, Saeed Mahloujifar, Prateek Mittal

Existing defenses against adversarial examples such as adversarial training typically assume that the adversary will conform to a specific or known threat model, such as $\ell_p$ perturbations within a fixed budget. In this paper, we focus on the scenario where there is a mismatch in the threat model assumed by the defense during training, and the actual capabilities of the adversary at test time. We ask the question: if the learner trains against a specific "source" threat model, when can we expect robustness to generalize to a stronger unknown "target" threat model during test-time? Our key contribution is to formally define the problem of learning and generalization with an unforeseen adversary, which helps us reason about the increase in adversarial risk from the conventional perspective of a known adversary. Applying our framework, we derive a generalization bound which relates the generalization gap between source and target threat models to variation of the feature extractor, which measures the expected maximum difference between extracted features across a given threat model. Based on our generalization bound, we propose variation regularization (VR) which reduces variation of the feature extractor across the source threat model during training. We empirically demonstrate that using VR can lead to improved generalization to unforeseen attacks during test-time, and combining VR with perceptual adversarial training (Laidlaw et al., 2021) achieves state-of-the-art robustness on unforeseen attacks. Our code is publicly available at https://github.com/inspire-group/variation-regularization.

CRApr 12, 2022
Optimal Membership Inference Bounds for Adaptive Composition of Sampled Gaussian Mechanisms

Saeed Mahloujifar, Alexandre Sablayrolles, Graham Cormode et al. · oxford

Given a trained model and a data sample, membership-inference (MI) attacks predict whether the sample was in the model's training set. A common countermeasure against MI attacks is to utilize differential privacy (DP) during model training to mask the presence of individual examples. While this use of DP is a principled approach to limit the efficacy of MI attacks, there is a gap between the bounds provided by DP and the empirical performance of MI attacks. In this paper, we derive bounds for the \textit{advantage} of an adversary mounting a MI attack, and demonstrate tightness for the widely-used Gaussian mechanism. We further show bounds on the \textit{confidence} of MI attacks. Our bounds are much stronger than those obtained by DP analysis. For example, analyzing a setting of DP-SGD with $ε=4$ would obtain an upper bound on the advantage of $\approx0.36$ based on our analyses, while getting bound of $\approx 0.97$ using the analysis of previous work that convert $ε$ to membership inference bounds. Finally, using our analysis, we provide MI metrics for models trained on CIFAR10 dataset. To the best of our knowledge, our analysis provides the state-of-the-art membership inference bounds for the privacy.

CVJul 22, 2022
Just Rotate it: Deploying Backdoor Attacks via Rotation Transformation

Tong Wu, Tianhao Wang, Vikash Sehwag et al. · princeton

Recent works have demonstrated that deep learning models are vulnerable to backdoor poisoning attacks, where these attacks instill spurious correlations to external trigger patterns or objects (e.g., stickers, sunglasses, etc.). We find that such external trigger signals are unnecessary, as highly effective backdoors can be easily inserted using rotation-based image transformation. Our method constructs the poisoned dataset by rotating a limited amount of objects and labeling them incorrectly; once trained with it, the victim's model will make undesirable predictions during run-time inference. It exhibits a significantly high attack success rate while maintaining clean performance through comprehensive empirical studies on image classification and object detection tasks. Furthermore, we evaluate standard data augmentation techniques and four different backdoor defenses against our attack and find that none of them can serve as a consistent mitigation approach. Our attack can be easily deployed in the real world since it only requires rotating the object, as we show in both image classification and object detection applications. Overall, our work highlights a new, simple, physically realizable, and highly effective vector for backdoor attacks. Our video demo is available at https://youtu.be/6JIF8wnX34M.

LGDec 8, 2022
A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization

Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar et al. · princeton

An open problem in differentially private deep learning is hyperparameter optimization (HPO). DP-SGD introduces new hyperparameters and complicates existing ones, forcing researchers to painstakingly tune hyperparameters with hundreds of trials, which in turn makes it impossible to account for the privacy cost of HPO without destroying the utility. We propose an adaptive HPO method that uses cheap trials (in terms of privacy cost and runtime) to estimate optimal hyperparameters and scales them up. We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning, across architectures and a wide range of $\varepsilon \in [0.01,8.0]$, all while accounting for the privacy cost of HPO.

LGJan 29, 2023
Uncovering Adversarial Risks of Test-Time Adaptation

Tong Wu, Feiran Jia, Xiangyu Qi et al. · princeton

Recently, test-time adaptation (TTA) has been proposed as a promising solution for addressing distribution shifts. It allows a base model to adapt to an unforeseen distribution during inference by leveraging the information from the batch of (unlabeled) test data. However, we uncover a novel security vulnerability of TTA based on the insight that predictions on benign samples can be impacted by malicious samples in the same batch. To exploit this vulnerability, we propose Distribution Invading Attack (DIA), which injects a small fraction of malicious data into the test batch. DIA causes models using TTA to misclassify benign and unperturbed test data, providing an entirely new capability for adversaries that is infeasible in canonical machine learning pipelines. Through comprehensive evaluations, we demonstrate the high effectiveness of our attack on multiple benchmarks across six TTA methods. In response, we investigate two countermeasures to robustify the existing insecure TTA implementations, following the principle of "security by design". Together, we hope our findings can make the community aware of the utility-security tradeoffs in deploying TTA and provide valuable insights for developing robust TTA approaches.

LGFeb 21, 2023
MultiRobustBench: Benchmarking Robustness Against Multiple Attacks

Sihui Dai, Saeed Mahloujifar, Chong Xiang et al. · princeton

The bulk of existing research in defending against adversarial examples focuses on defending against a single (typically bounded Lp-norm) attack, but for a practical setting, machine learning (ML) models should be robust to a wide variety of attacks. In this paper, we present the first unified framework for considering multiple attacks against ML models. Our framework is able to model different levels of learner's knowledge about the test-time adversary, allowing us to model robustness against unforeseen attacks and robustness against unions of attacks. Using our framework, we present the first leaderboard, MultiRobustBench, for benchmarking multiattack evaluation which captures performance across attack types and attack strengths. We evaluate the performance of 16 defended models for robustness against a set of 9 different attack types, including Lp-based threat models, spatial transformations, and color changes, at 20 different attack strengths (180 attacks total). Additionally, we analyze the state of current defenses against multiple attacks. Our analysis shows that while existing defenses have made progress in terms of average robustness across the set of attacks used, robustness against the worst-case attack is still a big open problem as all existing models perform worse than random guessing.

LGMay 26, 2022
Towards A Proactive ML Approach for Detecting Backdoor Poison Samples

Xiangyu Qi, Tinghao Xie, Jiachen T. Wang et al.

Adversaries can embed backdoors in deep learning models by introducing backdoor poison samples into training datasets. In this work, we investigate how to detect such poison samples to mitigate the threat of backdoor attacks. First, we uncover a post-hoc workflow underlying most prior work, where defenders passively allow the attack to proceed and then leverage the characteristics of the post-attacked model to uncover poison samples. We reveal that this workflow does not fully exploit defenders' capabilities, and defense pipelines built on it are prone to failure or performance degradation in many scenarios. Second, we suggest a paradigm shift by promoting a proactive mindset in which defenders engage proactively with the entire model training and poison detection pipeline, directly enforcing and magnifying distinctive characteristics of the post-attacked model to facilitate poison detection. Based on this, we formulate a unified framework and provide practical insights on designing detection pipelines that are more robust and generalizable. Third, we introduce the technique of Confusion Training (CT) as a concrete instantiation of our framework. CT applies an additional poisoning attack to the already poisoned dataset, actively decoupling benign correlation while exposing backdoor patterns to detection. Empirical evaluations on 4 datasets and 14 types of attacks validate the superiority of CT over 14 baseline defenses.

LGOct 27, 2023
Publicly-Detectable Watermarking for Language Models

Jaiden Fairoze, Sanjam Garg, Somesh Jha et al.

We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correction to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and find that our formal claims are met in practice.

CRFeb 14, 2023
Bounding Training Data Reconstruction in DP-SGD

Jamie Hayes, Saeed Mahloujifar, Borja Balle

Differentially private training offers a protection which is usually interpreted as a guarantee against membership inference attacks. By proxy, this guarantee extends to other threats like reconstruction attacks attempting to extract complete training examples. Recent works provide evidence that if one does not need to protect against membership attacks but instead only wants to protect against training data reconstruction, then utility of private models can be improved because less noise is required to protect against these more ambitious attacks. We investigate this further in the context of DP-SGD, a standard algorithm for private deep learning, and provide an upper bound on the success of any reconstruction attack against DP-SGD together with an attack that empirically matches the predictions of our bound. Together, these two results open the door to fine-grained investigations on how to set the privacy parameters of DP-SGD in practice to protect against reconstruction attacks. Finally, we use our methods to demonstrate that different settings of the DP-SGD parameters leading to the same DP guarantees can result in significantly different success rates for reconstruction, indicating that the DP guarantee alone might not be a good proxy for controlling the protection against reconstruction attacks.

CRApr 17, 2023
A Randomized Approach for Tight Privacy Accounting

Jiachen T. Wang, Saeed Mahloujifar, Tong Wu et al.

Bounding privacy leakage over compositions, i.e., privacy accounting, is a key challenge in differential privacy (DP). The privacy parameter ($\eps$ or $δ$) is often easy to estimate but hard to bound. In this paper, we propose a new differential privacy paradigm called estimate-verify-release (EVR), which addresses the challenges of providing a strict upper bound for privacy parameter in DP compositions by converting an estimate of privacy parameter into a formal guarantee. The EVR paradigm first estimates the privacy parameter of a mechanism, then verifies whether it meets this guarantee, and finally releases the query output based on the verification result. The core component of the EVR is privacy verification. We develop a randomized privacy verifier using Monte Carlo (MC) technique. Furthermore, we propose an MC-based DP accountant that outperforms existing DP accounting techniques in terms of accuracy and efficiency. Our empirical evaluation shows the newly proposed EVR paradigm improves the utility-privacy tradeoff for privacy-preserving machine learning.

LGAug 27, 2022
Overparameterization from Computational Constraints

Sanjam Garg, Somesh Jha, Saeed Mahloujifar et al.

Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded learners need \emph{significantly more} model parameters than what information-theoretic learners need. Furthermore, we show that even more model parameters could be necessary for robust learning. In particular, for computationally bounded learners, we extend the recent result of Bubeck and Sellke [NeurIPS'2021] which shows that robust models might need more parameters, to the computational regime and show that bounded learners could provably need an even larger number of parameters. Then, we address the following related question: can we hope to remedy the situation for robust computationally bounded learning by restricting \emph{adversaries} to also be computationally bounded for sake of obtaining models with fewer parameters? Here again, we show that this could be possible. Specifically, building on the work of Garg, Jha, Mahloujifar, and Mahmoody [ALT'2020], we demonstrate a learning task that can be learned efficiently and robustly against a computationally bounded attacker, while to be robust against an information-theoretic attacker requires the learner to utilize significantly more parameters.

CRSep 16, 2022
Renyi Differential Privacy of Propose-Test-Release and Applications to Private and Robust Machine Learning

Jiachen T. Wang, Saeed Mahloujifar, Shouda Wang et al.

Propose-Test-Release (PTR) is a differential privacy framework that works with local sensitivity of functions, instead of their global sensitivity. This framework is typically used for releasing robust statistics such as median or trimmed mean in a differentially private manner. While PTR is a common framework introduced over a decade ago, using it in applications such as robust SGD where we need many adaptive robust queries is challenging. This is mainly due to the lack of Renyi Differential Privacy (RDP) analysis, an essential ingredient underlying the moments accountant approach for differentially private deep learning. In this work, we generalize the standard PTR and derive the first RDP bound for it when the target function has bounded global sensitivity. We show that our RDP bound for PTR yields tighter DP guarantees than the directly analyzed $(\eps, δ)$-DP. We also derive the algorithm-specific privacy amplification bound of PTR under subsampling. We show that our bound is much tighter than the general upper bound and close to the lower bound. Our RDP bounds enable tighter privacy loss calculation for the composition of many adaptive runs of PTR. As an application of our analysis, we show that PTR and our theoretical results can be used to design differentially private variants for byzantine robust training algorithms that use robust statistics for gradients aggregation. We conduct experiments on the settings of label, feature, and gradient corruption across different datasets and architectures. We show that PTR-based private and robust training algorithm significantly improves the utility compared with the baseline.

LGFeb 4
Learning to Reason in 13 Parameters

John X. Morris, Niloofar Mireshghallah, Mark Ibrahim et al.

Recent research has shown that language models can learn to \textit{reason}, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91\% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90\% of performance improvements while training $1000x$ fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require $100-1000x$ larger updates to reach the same performance.

LGMay 26, 2022
Circumventing Backdoor Defenses That Are Based on Latent Separability

Xiangyu Qi, Tinghao Xie, Yiming Li et al.

Recent studies revealed that deep learning is susceptible to backdoor poisoning attacks. An adversary can embed a hidden backdoor into a model to manipulate its predictions by only modifying a few training data, without controlling the training process. Currently, a tangible signature has been widely observed across a diverse set of backdoor poisoning attacks -- models trained on a poisoned dataset tend to learn separable latent representations for poison and clean samples. This latent separation is so pervasive that a family of backdoor defenses directly take it as a default assumption (dubbed latent separability assumption), based on which to identify poison samples via cluster analysis in the latent space. An intriguing question consequently follows: is the latent separation unavoidable for backdoor poisoning attacks? This question is central to understanding whether the assumption of latent separability provides a reliable foundation for defending against backdoor poisoning attacks. In this paper, we design adaptive backdoor poisoning attacks to present counter-examples against this assumption. Our methods include two key components: (1) a set of trigger-planted samples correctly labeled to their semantic classes (other than the target class) that can regularize backdoor learning; (2) asymmetric trigger planting strategies that help to boost attack success rate (ASR) as well as to diversify latent representations of poison samples. Extensive experiments on benchmark datasets verify the effectiveness of our adaptive attacks in bypassing existing latent separation based backdoor defenses. Moreover, our attacks still maintain a high attack success rate with negligible clean accuracy drop. Our studies call for defense designers to take caution when leveraging latent separation as an assumption in their defenses.

LGDec 18, 2025
Privacy Blur: Quantifying Privacy and Utility for Image Data Release

Saeed Mahloujifar, Narine Kokhlikyan, Chuan Guo et al.

Image data collected in the wild often contains private information such as faces and license plates, and responsible data release must ensure that this information stays hidden. At the same time, released data should retain its usefulness for model-training. The standard method for private information obfuscation in images is Gaussian blurring. In this work, we show that practical implementations of Gaussian blurring are reversible enough to break privacy. We then take a closer look at the privacy-utility tradeoffs offered by three other obfuscation algorithms -- pixelization, pixelization and noise addition (DP-Pix), and cropping. Privacy is evaluated by reversal and discrimination attacks, while utility by the quality of the learnt representations when the model is trained on data with obfuscated faces. We show that the most popular industry-standard method, Gaussian blur is the least private of the four -- being susceptible to reversal attacks in its practical low-precision implementations. In contrast, pixelization and pixelization plus noise addition, when used at the right level of granularity, offer both privacy and utility for a number of computer vision tasks. We make our proposed methods together with suggested parameters available in a software package called Privacy Blur.

LGDec 12, 2021Code
SparseFed: Mitigating Model Poisoning Attacks in Federated Learning with Sparsification

Ashwinee Panda, Saeed Mahloujifar, Arjun N. Bhagoji et al.

Federated learning is inherently vulnerable to model poisoning attacks because its decentralized nature allows attackers to participate with compromised devices. In model poisoning attacks, the attacker reduces the model's performance on targeted sub-tasks (e.g. classifying planes as birds) by uploading "poisoned" updates. In this report we introduce \algoname{}, a novel defense that uses global top-k update sparsification and device-level gradient clipping to mitigate model poisoning attacks. We propose a theoretical framework for analyzing the robustness of defenses against poisoning attacks, and provide robustness and convergence analysis of our algorithm. To validate its empirical efficacy we conduct an open-source evaluation at scale across multiple benchmark datasets for computer vision and federated learning.

LGMay 29, 2019Code
Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness

Saeed Mahloujifar, Xiao Zhang, Mohammad Mahmoody et al.

Many recent works have shown that adversarial examples that fool classifiers can be found by minimally perturbing a normal input. Recent theoretical results, starting with Gilmer et al. (2018b), show that if the inputs are drawn from a concentrated metric probability space, then adversarial examples with small perturbation are inevitable. A concentrated space has the property that any subset with $Ω(1)$ (e.g., 1/100) measure, according to the imposed distribution, has small distance to almost all (e.g., 99/100) of the points in the space. It is not clear, however, whether these theoretical results apply to actual distributions such as images. This paper presents a method for empirically measuring and bounding the concentration of a concrete dataset which is proven to converge to the actual concentration. We use it to empirically estimate the intrinsic robustness to $\ell_\infty$ and $\ell_2$ perturbations of several image classification benchmarks. Code for our experiments is available at https://github.com/xiaozhanguva/Measure-Concentration.

LGJan 9, 2024
Private Fine-tuning of Large Language Models with Zeroth-order Optimization

Xinyu Tang, Ashwinee Panda, Milad Nasr et al.

Differentially private stochastic gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner, but has proven difficult to scale to the era of foundation models. We introduce DP-ZO, a private fine-tuning framework for large language models by privatizing zeroth order optimization methods. A key insight into the design of our method is that the direction of the gradient in the zeroth-order optimization we use is random and the only information from training data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO provides a strong privacy-utility trade-off across different tasks, and model sizes that are comparable to DP-SGD in $(\varepsilon,δ)$-DP. Notably, DP-ZO possesses significant advantages over DP-SGD in memory efficiency, and obtains higher utility in $\varepsilon$-DP when using the Laplace mechanism.

CLMay 30, 2025
How much do language models memorize?

John X. Morris, Chawin Sitawarin, Chuan Guo et al. · deepmind, meta-ai

We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from generalization. We formally separate memorization into two components: unintended memorization, the information a model contains about a specific dataset, and generalization, the information a model contains about the true data-generation process. When we completely eliminate generalization, we can compute the total memorization, which provides an estimate of model capacity: our measurements estimate that GPT-style models have a capacity of approximately 3.6 bits per parameter. We train language models on datasets of increasing size and observe that models memorize until their capacity fills, at which point "grokking" begins, and unintended memorization decreases as models begin to generalize. We train hundreds of transformer language models ranging from $500K$ to $1.5B$ parameters and produce a series of scaling laws relating model capacity and data size to membership inference.

LGOct 29, 2024
Auditing $f$-Differential Privacy in One Run

Saeed Mahloujifar, Luca Melis, Kamalika Chaudhuri

Empirical auditing has emerged as a means of catching some of the flaws in the implementation of privacy-preserving algorithms. Existing auditing mechanisms, however, are either computationally inefficient requiring multiple runs of the machine learning algorithms or suboptimal in calculating an empirical privacy. In this work, we present a tight and efficient auditing procedure and analysis that can effectively assess the privacy of mechanisms. Our approach is efficient; similar to the recent work of Steinke, Nasr, and Jagielski (2023), our auditing procedure leverages the randomness of examples in the input dataset and requires only a single run of the target mechanism. And it is more accurate; we provide a novel analysis that enables us to achieve tight empirical privacy estimates by using the hypothesized $f$-DP curve of the mechanism, which provides a more accurate measure of privacy than the traditional $ε,δ$ differential privacy parameters. We use our auditing procure and analysis to obtain empirical privacy, demonstrating that our auditing procedure delivers tighter privacy estimates.

CVDec 11, 2024
Unlocking Visual Secrets: Inverting Features with Diffusion Priors for Image Reconstruction

Sai Qian Zhang, Ziyun Li, Chuan Guo et al.

Inverting visual representations within deep neural networks (DNNs) presents a challenging and important problem in the field of security and privacy for deep learning. The main goal is to invert the features of an unidentified target image generated by a pre-trained DNN, aiming to reconstruct the original image. Feature inversion holds particular significance in understanding the privacy leakage inherent in contemporary split DNN execution techniques, as well as in various applications based on the extracted DNN features. In this paper, we explore the use of diffusion models, a promising technique for image synthesis, to enhance feature inversion quality. We also investigate the potential of incorporating alternative forms of prior knowledge, such as textual prompts and cross-frame temporal correlations, to further improve the quality of inverted features. Our findings reveal that diffusion models can effectively leverage hidden information from the DNN features, resulting in superior reconstruction performance compared to previous methods. This research offers valuable insights into how diffusion models can enhance privacy and security within applications that are reliant on DNN features.

CRFeb 24, 2025
Detecting Benchmark Contamination Through Watermarking

Tom Sander, Pierre Fernandez, Saeed Mahloujifar et al.

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, \eg $p$-val $=10^{-3}$ for +5$\%$ on ARC-Easy.

LGApr 3, 2024
Guarantees of confidentiality via Hammersley-Chapman-Robbins bounds

Kamalika Chaudhuri, Chuan Guo, Laurens van der Maaten et al.

Protecting privacy during inference with deep neural networks is possible by adding noise to the activations in the last layers prior to the final classifiers or other task-specific layers. The activations in such layers are known as "features" (or, less commonly, as "embeddings" or "feature embeddings"). The added noise helps prevent reconstruction of the inputs from the noisy features. Lower bounding the variance of every possible unbiased estimator of the inputs quantifies the confidentiality arising from such added noise. Convenient, computationally tractable bounds are available from classic inequalities of Hammersley and of Chapman and Robbins -- the HCR bounds. Numerical experiments indicate that the HCR bounds are on the precipice of being effectual for small neural nets with the data sets, "MNIST" and "CIFAR-10," which contain 10 classes each for image classification. The HCR bounds appear to be insufficient on their own to guarantee confidentiality of the inputs to inference with standard deep neural nets, "ResNet-18" and "Swin-T," pre-trained on the data set, "ImageNet-1000," which contains 1000 classes. Supplementing the addition of noise to features with other methods for providing confidentiality may be warranted in the case of ImageNet. In all cases, the results reported here limit consideration to amounts of added noise that incur little degradation in the accuracy of classification from the noisy features. Thus, the added noise enhances confidentiality without much reduction in the accuracy on the task of image classification.

CRMar 7, 2024
Privacy Amplification for the Gaussian Mechanism via Bounded Support

Shengyuan Hu, Saeed Mahloujifar, Virginia Smith et al.

Data-dependent privacy accounting frameworks such as per-instance differential privacy (pDP) and Fisher information loss (FIL) confer fine-grained privacy guarantees for individuals in a fixed training dataset. These guarantees can be desirable compared to vanilla DP in real world settings as they tightly upper-bound the privacy leakage for a $\textit{specific}$ individual in an $\textit{actual}$ dataset, rather than considering worst-case datasets. While these frameworks are beginning to gain popularity, to date, there is a lack of private mechanisms that can fully leverage advantages of data-dependent accounting. To bridge this gap, we propose simple modifications of the Gaussian mechanism with bounded support, showing that they amplify privacy guarantees under data-dependent accounting. Experiments on model training with DP-SGD show that using bounded support Gaussian mechanisms can provide a reduction of the pDP bound $ε$ by as much as 30% without negative effects on model utility.

LGNov 18, 2025
Observational Auditing of Label Privacy

Iden Kalemaj, Luca Melis, Maxime Boucher et al.

Differential privacy (DP) auditing is essential for evaluating privacy guarantees in machine learning systems. Existing auditing methods, however, pose a significant challenge for large-scale systems since they require modifying the training dataset -- for instance, by injecting out-of-distribution canaries or removing samples from training. Such interventions on the training data pipeline are resource-intensive and involve considerable engineering overhead. We introduce a novel observational auditing framework that leverages the inherent randomness of data distributions, enabling privacy evaluation without altering the original dataset. Our approach extends privacy auditing beyond traditional membership inference to protected attributes, with labels as a special case, addressing a key gap in existing techniques. We provide theoretical foundations for our method and perform experiments on Criteo and CIFAR-10 datasets that demonstrate its effectiveness in auditing label privacy guarantees. This work opens new avenues for practical privacy auditing in large-scale production environments.

LGOct 13, 2025
Z0-Inf: Zeroth Order Approximation for Data Influence

Narine Kokhlikyan, Kamalika Chaudhuri, Saeed Mahloujifar

A critical aspect of analyzing and improving modern machine learning systems lies in understanding how individual training examples influence a model's predictive behavior. Estimating this influence enables critical applications, including data selection and model debugging; in particular, self-influence, which quantifies the influence of a training point on itself, has found many uses in data quality assessment and outlier detection. Existing methods for measuring data influence, however, are often impractical for large models due to low accuracy or prohibitive computational costs: most approaches either provide poor approximations or rely on gradients and inverse-Hessian computations that remain challenging to scale. In this work, we introduce a highly efficient zeroth-order approximation for estimating the influence of training data that requires only a fraction of the time and memory footprint of prior methods. Notably, our method relies solely on loss values of intermediate checkpoints on the training and test data, along with the checkpoints themselves, making it broadly applicable even when the loss function of interest is non-differentiable. Beyond its computational efficiency, our approach achieves superior accuracy in estimating self-influence and comparable or improved accuracy in estimating train-test influence for fine-tuned large language models, enabling scalable and practical analysis of how training data shapes model behavior.

CRJun 24, 2025
Machine Learning with Privacy for Protected Attributes

Saeed Mahloujifar, Chuan Guo, G. Edward Suh et al.

Differential privacy (DP) has become the standard for private data analysis. Certain machine learning applications only require privacy protection for specific protected attributes. Using naive variants of differential privacy in such use cases can result in unnecessary degradation of utility. In this work, we refine the definition of DP to create a more general and flexible framework that we call feature differential privacy (FDP). Our definition is simulation-based and allows for both addition/removal and replacement variants of privacy, and can handle arbitrary and adaptive separation of protected and non-protected features. We prove the properties of FDP, such as adaptive composition, and demonstrate its implications for limiting attribute inference attacks. We also propose a modification of the standard DP-SGD algorithm that satisfies FDP while leveraging desirable properties such as amplification via sub-sampling. We apply our framework to various machine learning tasks and show that it can significantly improve the utility of DP-trained models when public features are available. For example, we train diffusion models on the AFHQ dataset of animal faces and observe a drastic improvement in FID compared to DP, from 286.7 to 101.9 at $ε=8$, assuming that the blurred version of a training image is available as a public feature. Overall, our work provides a new approach to private data analysis that can help reduce the utility cost of DP while still providing strong privacy guarantees.

CVFeb 3, 2022
ObjectSeeker: Certifiably Robust Object Detection against Patch Hiding Attacks via Patch-agnostic Masking

Chong Xiang, Alexander Valtchanov, Saeed Mahloujifar et al.

Object detectors, which are widely deployed in security-critical systems such as autonomous vehicles, have been found vulnerable to patch hiding attacks. An attacker can use a single physically-realizable adversarial patch to make the object detector miss the detection of victim objects and undermine the functionality of object detection applications. In this paper, we propose ObjectSeeker for certifiably robust object detection against patch hiding attacks. The key insight in ObjectSeeker is patch-agnostic masking: we aim to mask out the entire adversarial patch without knowing the shape, size, and location of the patch. This masking operation neutralizes the adversarial effect and allows any vanilla object detector to safely detect objects on the masked images. Remarkably, we can evaluate ObjectSeeker's robustness in a certifiable manner: we develop a certification procedure to formally determine if ObjectSeeker can detect certain objects against any white-box adaptive attack within the threat model, achieving certifiable robustness. Our experiments demonstrate a significant (~10%-40% absolute and ~2-6x relative) improvement in certifiable robustness over the prior work, as well as high clean performance (~1% drop compared with undefended models).

CROct 15, 2021
Mitigating Membership Inference Attacks by Self-Distillation Through a Novel Ensemble Architecture

Xinyu Tang, Saeed Mahloujifar, Liwei Song et al.

Membership inference attacks are a key measure to evaluate privacy leakage in machine learning (ML) models. These attacks aim to distinguish training members from non-members by exploiting differential behavior of the models on member and non-member inputs. The goal of this work is to train ML models that have high membership privacy while largely preserving their utility; we therefore aim for an empirical membership privacy guarantee as opposed to the provable privacy guarantees provided by techniques like differential privacy, as such techniques are shown to deteriorate model utility. Specifically, we propose a new framework to train privacy-preserving models that induces similar behavior on member and non-member inputs to mitigate membership inference attacks. Our framework, called SELENA, has two major components. The first component and the core of our defense is a novel ensemble architecture for training. This architecture, which we call Split-AI, splits the training data into random subsets, and trains a model on each subset of the data. We use an adaptive inference strategy at test time: our ensemble architecture aggregates the outputs of only those models that did not contain the input sample in their training data. We prove that our Split-AI architecture defends against a large family of membership inference attacks, however, it is susceptible to new adaptive attacks. Therefore, we use a second component in our framework called Self-Distillation to protect against such stronger attacks. The Self-Distillation component (self-)distills the training dataset through our Split-AI ensemble, without using any external public datasets. Through extensive experiments on major benchmark datasets we show that SELENA presents a superior trade-off between membership privacy and utility compared to the state of the art.

LGOct 11, 2021
Parameterizing Activation Functions for Adversarial Robustness

Sihui Dai, Saeed Mahloujifar, Prateek Mittal

Deep neural networks are known to be vulnerable to adversarially perturbed inputs. A commonly used defense is adversarial training, whose performance is influenced by model capacity. While previous works have studied the impact of varying model width and depth on robustness, the impact of increasing capacity by using learnable parametric activation functions (PAFs) has not been studied. We study how using learnable PAFs can improve robustness in conjunction with adversarial training. We first ask the question: how should we incorporate parameters into activation functions to improve robustness? To address this, we analyze the direct impact of activation shape on robustness through PAFs and observe that activation shapes with positive outputs on negative inputs and with high finite curvature can increase robustness. We combine these properties to create a new PAF, which we call Parametric Shifted Sigmoidal Linear Unit (PSSiLU). We then combine PAFs (including PReLU, PSoftplus and PSSiLU) with adversarial training and analyze robust performance. We find that PAFs optimize towards activation shape properties found to directly affect robustness. Additionally, we find that while introducing only 1-2 learnable parameters into the network, smooth PAFs can significantly increase robustness over ReLU. For instance, when trained on CIFAR-10 with additional synthetic data, PSSiLU improves robust accuracy by 4.54% over ReLU on ResNet-18 and 2.69% over ReLU on WRN-28-10 in the $\ell_{\infty}$ threat model while adding only 2 additional parameters into the network architecture. The PSSiLU WRN-28-10 model achieves 61.96% AutoAttack accuracy, improving over the state-of-the-art robust accuracy on RobustBench (Croce et al., 2020).

CVAug 20, 2021
PatchCleanser: Certifiably Robust Defense against Adversarial Patches for Any Image Classifier

Chong Xiang, Saeed Mahloujifar, Prateek Mittal

The adversarial patch attack against image classification models aims to inject adversarially crafted pixels within a restricted image region (i.e., a patch) for inducing model misclassification. This attack can be realized in the physical world by printing and attaching the patch to the victim object; thus, it imposes a real-world threat to computer vision systems. To counter this threat, we design PatchCleanser as a certifiably robust defense against adversarial patches. In PatchCleanser, we perform two rounds of pixel masking on the input image to neutralize the effect of the adversarial patch. This image-space operation makes PatchCleanser compatible with any state-of-the-art image classifier for achieving high accuracy. Furthermore, we can prove that PatchCleanser will always predict the correct class labels on certain images against any adaptive white-box attacker within our threat model, achieving certified robustness. We extensively evaluate PatchCleanser on the ImageNet, ImageNette, CIFAR-10, CIFAR-100, SVHN, and Flowers-102 datasets and demonstrate that our defense achieves similar clean accuracy as state-of-the-art classification models and also significantly improves certified robustness from prior works. Remarkably, PatchCleanser achieves 83.9% top-1 clean accuracy and 62.1% top-1 certified robust accuracy against a 2%-pixel square patch anywhere on the image for the 1000-class ImageNet dataset.

CRAug 16, 2021
NeuraCrypt is not private

Nicholas Carlini, Sanjam Garg, Somesh Jha et al.

NeuraCrypt (Yara et al. arXiv 2021) is an algorithm that converts a sensitive dataset to an encoded dataset so that (1) it is still possible to train machine learning models on the encoded data, but (2) an adversary who has access only to the encoded dataset can not learn much about the original sensitive dataset. We break NeuraCrypt privacy claims, by perfectly solving the authors' public challenge, and by showing that NeuraCrypt does not satisfy the formal privacy definitions posed in the original paper. Our attack consists of a series of boosting steps that, coupled with various design flaws, turns a 1% attack advantage into a 100% complete break of the scheme.

CLJun 21, 2021
Membership Inference on Word Embedding and Beyond

Saeed Mahloujifar, Huseyin A. Inan, Melissa Chase et al.

In the text processing context, most ML models are built on word embeddings. These embeddings are themselves trained on some datasets, potentially containing sensitive data. In some cases this training is done independently, in other cases, it occurs as part of training a larger, task-specific model. In either case, it is of interest to consider membership inference attacks based on the embedding layer as a way of understanding sensitive information leakage. But, somewhat surprisingly, membership inference attacks on word embeddings and their effect in other natural language processing (NLP) tasks that use these embeddings, have remained relatively unexplored. In this work, we show that word embeddings are vulnerable to black-box membership inference attacks under realistic assumptions. Furthermore, we show that this leakage persists through two other major NLP applications: classification and text-generation, even when the embedding layer is not exposed to the attacker. We show that our MI attack achieves high attack accuracy against a classifier model and an LSTM-based language model. Indeed, our attack is a cheaper membership inference attack on text-generative models, which does not require the knowledge of the target model or any expensive training of text-generative models as shadow models.

LGApr 19, 2021
Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?

Vikash Sehwag, Saeed Mahloujifar, Tinashe Handina et al.

While additional training data improves the robustness of deep neural networks against adversarial examples, it presents the challenge of curating a large number of specific real-world samples. We circumvent this challenge by using additional data from proxy distributions learned by advanced generative models. We first seek to formally understand the transfer of robustness from classifiers trained on proxy distributions to the real data distribution. We prove that the difference between the robustness of a classifier on the two distributions is upper bounded by the conditional Wasserstein distance between them. Next we use proxy distributions to significantly improve the performance of adversarial training on five different datasets. For example, we improve robust accuracy by up to 7.5% and 6.7% in $\ell_{\infty}$ and $\ell_2$ threat model over baselines that are not using proxy distributions on the CIFAR-10 dataset. We also improve certified robust accuracy by 7.6% on the CIFAR-10 dataset. We further demonstrate that different generative models bring a disparate improvement in the performance in robust training. We propose a robust discrimination approach to characterize the impact of individual generative models and further provide a deeper understanding of why current state-of-the-art in diffusion-based generative models are a better choice for proxy distribution than generative adversarial networks.

LGJan 26, 2021
Property Inference From Poisoning

Melissa Chase, Esha Ghosh, Saeed Mahloujifar

Property inference attacks consider an adversary who has access to the trained model and tries to extract some global statistics of the training data. In this work, we study property inference in scenarios where the adversary can maliciously control part of the training data (poisoning data) with the goal of increasing the leakage. Previous work on poisoning attacks focused on trying to decrease the accuracy of models either on the whole population or on specific sub-populations or instances. Here, for the first time, we study poisoning attacks where the goal of the adversary is to increase the information leakage of the model. Our findings suggest that poisoning attacks can boost the information leakage significantly and should be considered as a stronger threat model in sensitive applications where some of the data sources may be malicious. We describe our \emph{property inference poisoning attack} that allows the adversary to learn the prevalence in the training data of any property it chooses. We theoretically prove that our attack can always succeed as long as the learning algorithm used has good generalization properties. We then verify the effectiveness of our attack by experimentally evaluating it on two datasets: a Census dataset and the Enron email dataset. We were able to achieve above $90\%$ attack accuracy with $9-10\%$ poisoning in all of our experiments.

CRNov 10, 2020
Is Private Learning Possible with Instance Encoding?

Nicholas Carlini, Samuel Deng, Sanjam Garg et al.

A private machine learning algorithm hides as much as possible about its training data while still preserving accuracy. In this work, we study whether a non-private learning algorithm can be made private by relying on an instance-encoding mechanism that modifies the training inputs before feeding them to a normal learner. We formalize both the notion of instance encoding and its privacy by providing two attack models. We first prove impossibility results for achieving a (stronger) model. Next, we demonstrate practical attacks in the second (weaker) attack model on InstaHide, a recent proposal by Huang, Song, Li and Arora [ICML'20] that aims to use instance encoding for privacy.

LGJun 30, 2020
Model-Targeted Poisoning Attacks with Provable Convergence

Fnu Suya, Saeed Mahloujifar, Anshuman Suri et al.

In a poisoning attack, an adversary with control over a small fraction of the training data attempts to select that data in a way that induces a corrupted model that misbehaves in favor of the adversary. We consider poisoning attacks against convex machine learning models and propose an efficient poisoning attack designed to induce a specified model. Unlike previous model-targeted poisoning attacks, our attack comes with provable convergence to {\it any} attainable target classifier. The distance from the induced classifier to the target classifier is inversely proportional to the square root of the number of poisoning points. We also provide a lower bound on the minimum number of poisoning points needed to achieve a given target classifier. Our method uses online convex optimization, so finds poisoning points incrementally. This provides more flexibility than previous attacks which require a priori assumption about the number of poisoning points. Our attack is the first model-targeted poisoning attack that provides provable convergence for convex models, and in our experiments, it either exceeds or matches state-of-the-art attacks in terms of attack success rate and distance to the target model.

LGMar 26, 2020
A Separation Result Between Data-oblivious and Data-aware Poisoning Attacks

Samuel Deng, Sanjam Garg, Somesh Jha et al.

Poisoning attacks have emerged as a significant security threat to machine learning algorithms. It has been demonstrated that adversaries who make small changes to the training set, such as adding specially crafted data points, can hurt the performance of the output model. Some of the stronger poisoning attacks require the full knowledge of the training data. This leaves open the possibility of achieving the same attack results using poisoning attacks that do not have the full knowledge of the clean training set. In this work, we initiate a theoretical study of the problem above. Specifically, for the case of feature selection with LASSO, we show that full-information adversaries (that craft poisoning examples based on the rest of the training data) are provably stronger than the optimal attacker that is oblivious to the training set yet has access to the distribution of the data. Our separation result shows that the two setting of data-aware and data-oblivious are fundamentally different and we cannot hope to always achieve the same attack or defense results in these scenarios.

DSJul 11, 2019
Computational Concentration of Measure: Optimal Bounds, Reductions, and More

Omid Etesami, Saeed Mahloujifar, Mohammad Mahmoody

Product measures of dimension $n$ are known to be concentrated in Hamming distance: for any set $S$ in the product space of probability $ε$, a random point in the space, with probability $1-δ$, has a neighbor in $S$ that is different from the original point in only $O(\sqrt{n\ln(1/(εδ))})$ coordinates. We obtain the tight computational version of this result, showing how given a random point and access to an $S$-membership oracle, we can find such a close point in polynomial time. This resolves an open question of [Mahloujifar and Mahmoody, ALT 2019]. As corollaries, we obtain polynomial-time poisoning and (in certain settings) evasion attacks against learning algorithms when the original vulnerabilities have any cryptographically non-negligible probability. We call our algorithm MUCIO ("MUltiplicative Conditional Influence Optimizer") since proceeding through the coordinates, it decides to change each coordinate of the given point based on a multiplicative version of the influence of that coordinate, where influence is computed conditioned on previously updated coordinates. We also define a new notion of algorithmic reduction between computational concentration of measure in different metric probability spaces. As an application, we get computational concentration of measure for high-dimensional Gaussian distributions under the $\ell_1$ metric. We prove several extensions to the results above: (1) Our computational concentration result is also true when the Hamming distance is weighted. (2) We obtain an algorithmic version of concentration around mean, more specifically, McDiarmid's inequality. (3) Our result generalizes to discrete random processes, and this leads to new tampering algorithms for collective coin tossing protocols. (4) We prove exponential lower bounds on the average running time of non-adaptive query algorithms.

LGJun 13, 2019
Lower Bounds for Adversarially Robust PAC Learning

Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody

In this work, we initiate a formal study of probably approximately correct (PAC) learning under evasion attacks, where the adversary's goal is to \emph{misclassify} the adversarially perturbed sample point $\widetilde{x}$, i.e., $h(\widetilde{x})\neq c(\widetilde{x})$, where $c$ is the ground truth concept and $h$ is the learned hypothesis. Previous work on PAC learning of adversarial examples have all modeled adversarial examples as corrupted inputs in which the goal of the adversary is to achieve $h(\widetilde{x}) \neq c(x)$, where $x$ is the original untampered instance. These two definitions of adversarial risk coincide for many natural distributions, such as images, but are incomparable in general. We first prove that for many theoretically natural input spaces of high dimension $n$ (e.g., isotropic Gaussian in dimension $n$ under $\ell_2$ perturbations), if the adversary is allowed to apply up to a sublinear $o(||x||)$ amount of perturbations on the test instances, PAC learning requires sample complexity that is exponential in $n$. This is in contrast with results proved using the corrupted-input framework, in which the sample complexity of robust learning is only polynomially more. We then formalize hybrid attacks in which the evasion attack is preceded by a poisoning attack. This is perhaps reminiscent of "trapdoor attacks" in which a poisoning phase is involved as well, but the evasion phase here uses the error-region definition of risk that aims at misclassifying the perturbed instances. In this case, we show PAC learning is sometimes impossible all together, even when it is possible without the attack (e.g., due to the bounded VC dimension).

LGMay 28, 2019
Adversarially Robust Learning Could Leverage Computational Hardness

Sanjam Garg, Somesh Jha, Saeed Mahloujifar et al.

Over recent years, devising classification algorithms that are robust to adversarial perturbations has emerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible to small imperceptible changes over test instances. However, the line of work in provable robustness, so far, has been focused on information-theoretic robustness, ruling out even the existence of any adversarial examples. In this work, we study whether there is a hope to benefit from algorithmic nature of an attacker that searches for adversarial examples, and ask whether there is any learning task for which it is possible to design classifiers that are only robust against polynomial-time adversaries. Indeed, numerous cryptographic tasks can only be secure against computationally bounded adversaries, and are indeed impossible for computationally unbounded attackers. Thus, it is natural to ask if the same strategy could help robust learning. We show that computational limitation of attackers can indeed be useful in robust learning by demonstrating the possibility of a classifier for some learning task for which computational and information theoretic adversaries of bounded perturbations have very different power. Namely, while computationally unbounded adversaries can attack successfully and find adversarial examples with small perturbation, polynomial time adversaries are unable to do so unless they can break standard cryptographic hardness assumptions. Our results, therefore, indicate that perhaps a similar approach to cryptography (relying on computational hardness) holds promise for achieving computationally robust machine learning. On the reverse directions, we also show that the existence of such learning task in which computational robustness beats information theoretic robustness requires computational hardness by implying (average-case) hardness of NP.

LGOct 29, 2018
Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution

Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody

We study adversarial perturbations when the instances are uniformly distributed over $\{0,1\}^n$. We study both "inherent" bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their goals, we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over $\{0,1\}^n$. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error $0.01$, there always exists an adversarial perturbation that changes $O(\sqrt{n})$ bits of the instances to increase the risk to $0.5$, making classifier's decisions meaningless. Furthermore, by also using the central limit theorem we show that when $n\to \infty$, at most $c \cdot \sqrt{n}$ bits of perturbations, for a universal constant $c< 1.17$, suffice for increasing the risk to $0.5$, and the same $c \cdot \sqrt{n} $ bits of perturbations on average suffice to increase the risk to $1$, hence bounding the robustness by $c \cdot \sqrt{n}$.

LGOct 2, 2018
Can Adversarially Robust Learning Leverage Computational Hardness?

Saeed Mahloujifar, Mohammad Mahmoody

Making learners robust to adversarial perturbation at test time (i.e., evasion attacks) or training time (i.e., poisoning attacks) has emerged as a challenging task. It is known that for some natural settings, sublinear perturbations in the training phase or the testing phase can drastically decrease the quality of the predictions. These negative results, however, are information theoretic and only prove the existence of such successful adversarial perturbations. A natural question for these settings is whether or not we can make classifiers computationally robust to polynomial-time attacks. In this work, we prove strong barriers against achieving such envisioned computational robustness both for evasion and poisoning attacks. In particular, we show that if the test instances come from a product distribution (e.g., uniform over $\{0,1\}^n$ or $[0,1]^n$, or isotropic $n$-variate Gaussian) and that there is an initial constant error, then there exists a polynomial-time attack that finds adversarial examples of Hamming distance $O(\sqrt n)$. For poisoning attacks, we prove that for any learning algorithm with sample complexity $m$ and any efficiently computable "predicate" defining some "bad" property $B$ for the produced hypothesis (e.g., failing on a particular test) that happens with an initial constant probability, there exist polynomial-time online poisoning attacks that tamper with $O (\sqrt m)$ many examples, replace them with other correctly labeled examples, and increases the probability of the bad event $B$ to $\approx 1$. Both of our poisoning and evasion attacks are black-box in how they access their corresponding components of the system (i.e., the hypothesis, the concept, and the learning algorithm) and make no further assumptions about the classifier or the learning algorithm producing the classifier.

LGSep 10, 2018
Universal Multi-Party Poisoning Attacks

Saeed Mahloujifar, Mohammad Mahmoody, Ameer Mohammed

In this work, we demonstrate universal multi-party poisoning attacks that adapt and apply to any multi-party learning process with arbitrary interaction pattern between the parties. More generally, we introduce and study $(k,p)$-poisoning attacks in which an adversary controls $k\in[m]$ of the parties, and for each corrupted party $P_i$, the adversary submits some poisoned data $\mathcal{T}'_i$ on behalf of $P_i$ that is still ``$(1-p)$-close'' to the correct data $\mathcal{T}_i$ (e.g., $1-p$ fraction of $\mathcal{T}'_i$ is still honestly generated). We prove that for any ``bad'' property $B$ of the final trained hypothesis $h$ (e.g., $h$ failing on a particular test example or having ``large'' risk) that has an arbitrarily small constant probability of happening without the attack, there always is a $(k,p)$-poisoning attack that increases the probability of $B$ from $μ$ to by $μ^{1-p \cdot k/m} = μ+ Ω(p \cdot k/m)$. Our attack only uses clean labels, and it is online. More generally, we prove that for any bounded function $f(x_1,\dots,x_n) \in [0,1]$ defined over an $n$-step random process $\mathbf{X} = (x_1,\dots,x_n)$, an adversary who can override each of the $n$ blocks with even dependent probability $p$ can increase the expected output by at least $Ω(p \cdot \mathrm{Var}[f(\mathbf{x})])$.

LGSep 9, 2018
The Curse of Concentration in Robust Learning: Evasion and Poisoning Attacks from Concentration of Measure

Saeed Mahloujifar, Dimitrios I. Diochnos, Mohammad Mahmoody

Many modern machine learning classifiers are shown to be vulnerable to adversarial perturbations of the instances. Despite a massive amount of work focusing on making classifiers robust, the task seems quite challenging. In this work, through a theoretical study, we investigate the adversarial risk and robustness of classifiers and draw a connection to the well-known phenomenon of concentration of measure in metric measure spaces. We show that if the metric probability space of the test instance is concentrated, any classifier with some initial constant error is inherently vulnerable to adversarial perturbations. One class of concentrated metric probability spaces are the so-called Levy families that include many natural distributions. In this special case, our attacks only need to perturb the test instance by at most $O(\sqrt n)$ to make it misclassified, where $n$ is the data dimension. Using our general result about Levy instance spaces, we first recover as special case some of the previously proved results about the existence of adversarial examples. However, many more Levy families are known (e.g., product distribution under the Hamming distance) for which we immediately obtain new attacks that find adversarial examples of distance $O(\sqrt n)$. Finally, we show that concentration of measure for product spaces implies the existence of forms of "poisoning" attacks in which the adversary tampers with the training data with the goal of degrading the classifier. In particular, we show that for any learning algorithm that uses $m$ training examples, there is an adversary who can increase the probability of any "bad property" (e.g., failing on a particular test instance) that initially happens with non-negligible probability to $\approx 1$ by substituting only $\tilde{O}(\sqrt m)$ of the examples with other (still correctly labeled) examples.

LGNov 10, 2017
Learning under $p$-Tampering Attacks

Saeed Mahloujifar, Dimitrios I. Diochnos, Mohammad Mahmoody

Recently, Mahloujifar and Mahmoody (TCC'17) studied attacks against learning algorithms using a special case of Valiant's malicious noise, called $p$-tampering, in which the adversary gets to change any training example with independent probability $p$ but is limited to only choose malicious examples with correct labels. They obtained $p$-tampering attacks that increase the error probability in the so called targeted poisoning model in which the adversary's goal is to increase the loss of the trained hypothesis over a particular test example. At the heart of their attack was an efficient algorithm to bias the expected value of any bounded real-output function through $p$-tampering. In this work, we present new biasing attacks for increasing the expected value of bounded real-valued functions. Our improved biasing attacks, directly imply improved $p$-tampering attacks against learners in the targeted poisoning model. As a bonus, our attacks come with considerably simpler analysis. We also study the possibility of PAC learning under $p$-tampering attacks in the non-targeted (aka indiscriminate) setting where the adversary's goal is to increase the risk of the generated hypothesis (for a random test example). We show that PAC learning is possible under $p$-tampering poisoning attacks essentially whenever it is possible in the realizable setting without the attacks. We further show that PAC learning under "correct-label" adversarial noise is not possible in general, if the adversary could choose the (still limited to only $p$ fraction of) tampered examples that she substitutes with adversarially chosen ones. Our formal model for such "bounded-budget" tampering attackers is inspired by the notions of (strong) adaptive corruption in secure multi-party computation.