LGMay 4, 2022
Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data PoisoningAntonio Emanuele Cinà, Kathrin Grosse, Ambra Demontis et al.
The success of machine learning is fueled by the increasing availability of computing power and large training datasets. The training data is used to learn new models or update existing ones, assuming that it is sufficiently representative of the data that will be encountered at test time. This assumption is challenged by the threat of poisoning, an attack that manipulates the training data to compromise the model's performance at test time. Although poisoning has been acknowledged as a relevant threat in industry applications, and a variety of different attacks and defenses have been proposed so far, a complete systematization and critical review of the field is still missing. In this survey, we provide a comprehensive systematization of poisoning attacks and defenses in machine learning, reviewing more than 100 papers published in the field in the last 15 years. We start by categorizing the current threat models and attacks, and then organize existing defenses accordingly. While we focus mostly on computer-vision applications, we argue that our systematization also encompasses state-of-the-art attacks and defenses for other data modalities. Finally, we discuss existing resources for research in poisoning, and shed light on the current limitations and open research questions in this research field.
CRApr 12, 2022
Machine Learning Security against Data Poisoning: Are We There Yet?Antonio Emanuele Cinà, Kathrin Grosse, Ambra Demontis et al.
The recent success of machine learning (ML) has been fueled by the increasing availability of computing power and large amounts of data in many different applications. However, the trustworthiness of the resulting models can be compromised when such data is maliciously manipulated to mislead the learning process. In this article, we first review poisoning attacks that compromise the training data used to learn ML models, including attacks that aim to reduce the overall performance, manipulate the predictions on specific test samples, and even implant backdoors in the model. We then discuss how to mitigate these attacks using basic security principles, or by deploying ML-oriented defensive mechanisms. We conclude our article by formulating some relevant open challenges which are hindering the development of testing methods and benchmarks suitable for assessing and improving the trustworthiness of ML models against data poisoning attacks
LGJul 11, 2022
Machine Learning Security in Industry: A Quantitative SurveyKathrin Grosse, Lukas Bieringer, Tarek Richard Besold et al.
Despite the large body of academic work on machine learning security, little is known about the occurrence of attacks on machine learning systems in the wild. In this paper, we report on a quantitative study with 139 industrial practitioners. We analyze attack occurrence and concern and evaluate statistical hypotheses on factors influencing threat perception and exposure. Our results shed light on real-world attacks on deployed machine learning. On the organizational level, while we find no predictors for threat exposure in our sample, the amount of implement defenses depends on exposure to threats or expected likelihood to become a target. We also provide a detailed analysis of practitioners' replies on the relevance of individual machine learning attacks, unveiling complex concerns like unreliable decision making, business information leakage, and bias introduction into models. Finally, we find that on the individual level, prior knowledge about machine learning security influences threat perception. Our work paves the way for more research about adversarial machine learning in practice, but yields also insights for regulation and auditing.
LGDec 12, 2022
Security of Deep Reinforcement Learning for Autonomous Driving: A SurveyAmbra Demontis, Srishti Gupta, Maura Pintor et al.
Reinforcement learning (RL) enables agents to learn optimal behaviors through interaction with their environment and has been increasingly deployed in safety-critical applications, including autonomous driving. Despite its promise, RL is susceptible to attacks designed either to compromise policy learning or to induce erroneous decisions by trained agents. Although the literature on RL security has grown rapidly and several surveys exist, existing categorizations often fall short in guiding the selection of appropriate defenses for specific systems. In this work, we present a comprehensive survey of 86 recent studies on RL security, addressing these limitations by systematically categorizing attacks and defenses according to defined threat models and single- versus multi-agent settings. Furthermore, we examine the relevance and applicability of state-of-the-art attacks and defense mechanisms within the context of autonomous driving, providing insights to inform the design of robust RL systems.
CRNov 16, 2023
Towards more Practical Threat Models in Artificial Intelligence SecurityKathrin Grosse, Lukas Bieringer, Tarek Richard Besold et al.
Recent works have identified a gap between research and practice in artificial intelligence security: threats studied in academia do not always reflect the practical use and security risks of AI. For example, while models are often studied in isolation, they form part of larger ML pipelines in practice. Recent works also brought forward that adversarial manipulations introduced by academic attacks are impractical. We take a first step towards describing the full extent of this disparity. To this end, we revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice via a survey with 271 industrial practitioners. On the one hand, we find that all existing threat models are indeed applicable. On the other hand, there are significant mismatches: research is often too generous with the attacker, assuming access to information not frequently available in real-world settings. Our paper is thus a call for action to study more practical threat models in artificial intelligence security.
59.7CRMay 5
Position: Mind the Gap-AI Security and the Limits of Current Reporting StandardsLukas Bieringer, Sean McGregor, Nicole Nichols et al.
AI systems face a growing number of AI security threats that are increasingly exploited in the real world. Hence, shared AI incident reporting practices are emerging in industry as best practice and as mandated by regulatory requirements. Although non-AI cybersecurity and non-security AI reporting have progressed as industrial and policy norms, existing collections of practices do not meet the specific requirements posed by AI security reporting. we argue that established processes are not well aligned with AI security reporting due to fundamental shortcomings for the distinctive characteristics of AI systems. Some of these shortcomings are immediately addressable, while others remain unresolved technically or within social systems, like the treatment of IP or the ownership of a vulnerability. Based on this position, we examine the limitations of current AI security incident reporting proposals. We conclude that the advent of AI agents will further reinforce the need to advance specialized AI security incident reporting.
LGJun 10, 2025
Design Patterns for Securing LLM Agents against Prompt InjectionsLuca Beurer-Kellner, Beat Buesser, Ana-Maria Creţu et al. · eth-zurich
As AI agents powered by Large Language Models (LLMs) become increasingly versatile and capable of addressing a broad spectrum of tasks, ensuring their security has become a critical challenge. Among the most pressing threats are prompt injection attacks, which exploit the agent's resilience on natural language inputs -- an especially dangerous threat when agents are granted tool access or handle sensitive information. In this work, we propose a set of principled design patterns for building AI agents with provable resistance to prompt injection. We systematically analyze these patterns, discuss their trade-offs in terms of utility and security, and illustrate their real-world applicability through a series of case studies.
CYFeb 21, 2024
Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairnessDavid Fernández Llorca, Ronan Hamon, Henrik Junklewitz et al.
This study explores the complexities of integrating Artificial Intelligence (AI) into Autonomous Vehicles (AVs), examining the challenges introduced by AI components and the impact on testing procedures, focusing on some of the essential requirements for trustworthy AI. Topics addressed include the role of AI at various operational layers of AVs, the implications of the EU's AI Act on AVs, and the need for new testing methodologies for Advanced Driver Assistance Systems (ADAS) and Automated Driving Systems (ADS). The study also provides a detailed analysis on the importance of cybersecurity audits, the need for explainability in AI decision-making processes and protocols for assessing the robustness and ethical behaviour of predictive systems in AVs. The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology, highlighting the need for multidisciplinary expertise.
LGDec 21, 2023
Manipulating Trajectory Prediction with BackdoorsKaouther Messaoud, Kathrin Grosse, Mickael Chen et al.
Autonomous vehicles ought to predict the surrounding agents' trajectories to allow safe maneuvers in uncertain and complex traffic situations. As companies increasingly apply trajectory prediction in the real world, security becomes a relevant concern. In this paper, we focus on backdoors - a security threat acknowledged in other fields but so far overlooked for trajectory prediction. To this end, we describe and investigate four triggers that could affect trajectory prediction. We then show that these triggers (for example, a braking vehicle), when correlated with a desired output (for example, a curve) during training, cause the desired output of a state-of-the-art trajectory prediction model. In other words, the model has good benign performance but is vulnerable to backdoors. This is the case even if the trigger maneuver is performed by a non-casual agent behind the target vehicle. As a side-effect, our analysis reveals interesting limitations within trajectory prediction models. Finally, we evaluate a range of defenses against backdoors. While some, like simple offroad checks, do not enable detection for all triggers, clustering is a promising candidate to support manual inspection to find backdoors.
LGOct 24, 2025
Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer ReviewsLuca Demetrio, Giovanni Apruzzese, Kathrin Grosse et al.
How does the progressive embracement of Large Language Models (LLMs) affect scientific peer reviewing? This multifaceted question is fundamental to the effectiveness -- as well as to the integrity -- of the scientific process. Recent evidence suggests that LLMs may have already been tacitly used in peer reviewing, e.g., at the 2024 International Conference of Learning Representations (ICLR). Furthermore, some efforts have been undertaken in an attempt to explicitly integrate LLMs in peer reviewing by various editorial boards (including that of ICLR'25). To fully understand the utility and the implications of LLMs' deployment for scientific reviewing, a comprehensive relevant dataset is strongly desirable. Despite some previous research on this topic, such dataset has been lacking so far. We fill in this gap by presenting GenReview, the hitherto largest dataset containing LLM-written reviews. Our dataset includes 81K reviews generated for all submissions to the 2018--2025 editions of the ICLR by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. GenReview is also linked to the respective papers and their original reviews, thereby enabling a broad range of investigations. To illustrate the value of GenReview, we explore a sample of intriguing research questions, namely: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with decisions on paper acceptance or rejection (holds true only for accepted papers). GenReview can be accessed at the following link: https://anonymous.4open.science/r/gen_review.
CRAug 29, 2025
I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing AttacksDaryna Oliynyk, Rudolf Mayer, Kathrin Grosse et al.
Model stealing attacks endanger the confidentiality of machine learning models offered as a service. Although these models are kept secret, a malicious party can query a model to label data samples and train their own substitute model, violating intellectual property. While novel attacks in the field are continually being published, their design and evaluations are not standardised, making it challenging to compare prior works and assess progress in the field. This paper is the first to address this gap by providing recommendations for designing and evaluating model stealing attacks. To this end, we study the largest group of attacks that rely on training a substitute model -- those attacking image classification models. We propose the first comprehensive threat model and develop a framework for attack comparison. Further, we analyse attack setups from related works to understand which tasks and models have been studied the most. Based on our findings, we present best practices for attack development before, during, and beyond experiments and derive an extensive list of open research questions regarding the evaluation of model stealing attacks. Our findings and recommendations also transfer to other problem domains, hence establishing the first generic evaluation methodology for model stealing attacks.
LGJun 14, 2021
Backdoor Learning Curves: Explaining Backdoor Poisoning Beyond Influence FunctionsAntonio Emanuele Cinà, Kathrin Grosse, Sebastiano Vascon et al.
Backdoor attacks inject poisoning samples during training, with the goal of forcing a machine learning model to output an attacker-chosen class when presented a specific trigger at test time. Although backdoor attacks have been demonstrated in a variety of settings and against different models, the factors affecting their effectiveness are still not well understood. In this work, we provide a unifying framework to study the process of backdoor learning under the lens of incremental learning and influence functions. We show that the effectiveness of backdoor attacks depends on: (i) the complexity of the learning algorithm, controlled by its hyperparameters; (ii) the fraction of backdoor samples injected into the training set; and (iii) the size and visibility of the backdoor trigger. These factors affect how fast a model learns to correlate the presence of the backdoor trigger with the target class. Our analysis unveils the intriguing existence of a region in the hyperparameter space in which the accuracy on clean test samples is still high while backdoor attacks are ineffective, thereby suggesting novel criteria to improve existing defenses.
CRMay 8, 2021
Mental Models of Adversarial Machine LearningLukas Bieringer, Kathrin Grosse, Michael Backes et al.
Although machine learning is widely used in practice, little is known about practitioners' understanding of potential security challenges. In this work, we close this substantial gap and contribute a qualitative study focusing on developers' mental models of the machine learning pipeline and potentially vulnerable components. Similar studies have helped in other security fields to discover root causes or improve risk communication. Our study reveals two \facets of practitioners' mental models of machine learning security. Firstly, practitioners often confuse machine learning security with threats and defences that are not directly related to machine learning. Secondly, in contrast to most academic research, our participants perceive security of machine learning as not solely related to individual models, but rather in the context of entire workflows that consist of multiple components. Jointly with our additional findings, these two facets provide a foundation to substantiate mental models for machine learning security and have implications for the integration of adversarial machine learning into corporate workflows, \new{decreasing practitioners' reported uncertainty}, and appropriate regulatory frameworks for machine learning security.
CRJul 14, 2020
Adversarial Examples and MetricsNico Döttling, Kathrin Grosse, Michael Backes et al.
Adversarial examples are a type of attack on machine learning (ML) systems which cause misclassification of inputs. Achieving robustness against adversarial examples is crucial to apply ML in the real world. While most prior work on adversarial examples is empirical, a recent line of work establishes fundamental limitations of robust classification based on cryptographic hardness. Most positive and negative results in this field however assume that there is a fixed target metric which constrains the adversary, and we argue that this is often an unrealistic assumption. In this work we study the limitations of robust classification if the target metric is uncertain. Concretely, we construct a classification problem, which admits robust classification by a small classifier if the target metric is known at the time the model is trained, but for which robust classification is impossible for small classifiers if the target metric is chosen after the fact. In the process, we explore a novel connection between hardness of robust classification and bounded storage model cryptography.
LGJun 12, 2020
How many winning tickets are there in one DNN?Kathrin Grosse, Michael Backes
The recent lottery ticket hypothesis proposes that there is one sub-network that matches the accuracy of the original network when trained in isolation. We show that instead each network contains several winning tickets, even if the initial weights are fixed. The resulting winning sub-networks are not instances of the same network under weight space symmetry, and show no overlap or correlation significantly larger than expected by chance. If randomness during training is decreased, overlaps higher than chance occur, even if the networks are trained on different tasks. We conclude that there is rather a distribution over capable sub-networks, as opposed to a single winning ticket.
LGJun 11, 2020
Backdoor Smoothing: Demystifying Backdoor Attacks on Deep Neural NetworksKathrin Grosse, Taesung Lee, Battista Biggio et al.
Backdoor attacks mislead machine-learning models to output an attacker-specified class when presented a specific trigger at test time. These attacks require poisoning the training data to compromise the learning algorithm, e.g., by injecting poisoning samples containing the trigger into the training set, along with the desired class label. Despite the increasing number of studies on backdoor attacks and defenses, the underlying factors affecting the success of backdoor attacks, along with their impact on the learning algorithm, are not yet well understood. In this work, we aim to shed light on this issue by unveiling that backdoor attacks induce a smoother decision function around the triggered samples -- a phenomenon which we refer to as \textit{backdoor smoothing}. To quantify backdoor smoothing, we define a measure that evaluates the uncertainty associated to the predictions of a classifier around the input samples. Our experiments show that smoothness increases when the trigger is added to the input samples, and that this phenomenon is more pronounced for more successful attacks. We also provide preliminary evidence that backdoor triggers are not the only smoothing-inducing patterns, but that also other artificial patterns can be detected by our approach, paving the way towards understanding the limitations of current defenses and designing novel ones.
CRSep 19, 2019
Adversarial Vulnerability Bounds for Gaussian Process ClassificationMichael Thomas Smith, Kathrin Grosse, Michael Backes et al.
Machine learning (ML) classification is increasingly used in safety-critical systems. Protecting ML classifiers from adversarial examples is crucial. We propose that the main threat is that of an attacker perturbing a confidently classified input to produce a confident misclassification. To protect against this we devise an adversarial bound (AB) for a Gaussian process classifier, that holds for the entire input domain, bounding the potential for any future adversarial method to cause such misclassification. This is a formal guarantee of robustness, not just an empirically derived result. We investigate how to configure the classifier to maximise the bound, including the use of a sparse approximation, leading to the method producing a practical, useful and provably robust classifier, which we test using a variety of datasets.
CRFeb 8, 2019
On the security relevance of weights in deep learningKathrin Grosse, Thomas A. Trost, Marius Mosbach et al.
Recently, a weight-based attack on stochastic gradient descent inducing overfitting has been proposed. We show that the threat is broader: A task-independent permutation on the initial weights suffices to limit the achieved accuracy to for example 50% on the Fashion MNIST dataset from initially more than $90$%. These findings are confirmed on MNIST and CIFAR. We formally confirm that the attack succeeds with high likelihood and does not depend on the data. Empirically, weight statistics and loss appear unsuspicious, making it hard to detect the attack if the user is not aware. Our paper is thus a call for action to acknowledge the importance of the initial weights in deep learning.
CRDec 6, 2018
The Limitations of Model Uncertainty in Adversarial SettingsKathrin Grosse, David Pfaff, Michael Thomas Smith et al.
Machine learning models are vulnerable to adversarial examples: minor perturbations to input samples intended to deliberately cause misclassification. While an obvious security threat, adversarial examples yield as well insights about the applied model itself. We investigate adversarial examples in the context of Bayesian neural network's (BNN's) uncertainty measures. As these measures are highly non-smooth, we use a smooth Gaussian process classifier (GPC) as substitute. We show that both confidence and uncertainty can be unsuspicious even if the output is wrong. Intriguingly, we find subtle differences in the features influencing uncertainty and confidence for most tasks.
CRAug 1, 2018
MLCapsule: Guarded Offline Deployment of Machine Learning as a ServiceLucjan Hanzlik, Yang Zhang, Kathrin Grosse et al.
With the widespread use of machine learning (ML) techniques, ML as a service has become increasingly popular. In this setting, an ML model resides on a server and users can query it with their data via an API. However, if the user's input is sensitive, sending it to the server is undesirable and sometimes even legally not possible. Equally, the service provider does not want to share the model by sending it to the client for protecting its intellectual property and pay-per-query business model. In this paper, we propose MLCapsule, a guarded offline deployment of machine learning as a service. MLCapsule executes the model locally on the user's side and therefore the data never leaves the client. Meanwhile, MLCapsule offers the service provider the same level of control and security of its model as the commonly used server-side execution. In addition, MLCapsule is applicable to offline applications that require local execution. Beyond protecting against direct model access, we couple the secure offline deployment with defenses against advanced attacks on machine learning models such as model stealing, reverse engineering, and membership inference.
CRJun 6, 2018
Killing four birds with one Gaussian process: the relation between different test-time attacksKathrin Grosse, Michael T. Smith, Michael Backes
In machine learning (ML) security, attacks like evasion, model stealing or membership inference are generally studied in individually. Previous work has also shown a relationship between some attacks and decision function curvature of the targeted model. Consequently, we study an ML model allowing direct control over the decision surface curvature: Gaussian Process classifiers (GPCs). For evasion, we find that changing GPC's curvature to be robust against one attack algorithm boils down to enabling a different norm or attack algorithm to succeed. This is backed up by our formal analysis showing that static security guarantees are opposed to learning. Concerning intellectual property, we show formally that lazy learning does not necessarily leak all information when applied. In practice, often a seemingly secure curvature can be found. For example, we are able to secure GPC against empirical membership inference by proper configuration. In this configuration, however, the GPC's hyper-parameters are leaked, e.g. model reverse engineering succeeds. We conclude that attacks on classification should not be studied in isolation, but in relation to each other.
CRNov 17, 2017
How Wrong Am I? - Studying Adversarial Examples and their Impact on Uncertainty in Gaussian Process Machine Learning ModelsKathrin Grosse, David Pfaff, Michael Thomas Smith et al.
Machine learning models are vulnerable to Adversarial Examples: minor perturbations to input samples intended to deliberately cause misclassification. Current defenses against adversarial examples, especially for Deep Neural Networks (DNN), are primarily derived from empirical developments, and their security guarantees are often only justified retroactively. Many defenses therefore rely on hidden assumptions that are subsequently subverted by increasingly elaborate attacks. This is not surprising: deep learning notoriously lacks a comprehensive mathematical framework to provide meaningful guarantees. In this paper, we leverage Gaussian Processes to investigate adversarial examples in the framework of Bayesian inference. Across different models and datasets, we find deviating levels of uncertainty reflect the perturbation introduced to benign samples by state-of-the-art attacks, including novel white-box attacks on Gaussian Processes. Our experiments demonstrate that even unoptimized uncertainty thresholds already reject adversarial examples in many scenarios. Comment: Thresholds can be broken in a modified attack, which was done in arXiv:1812.02606 (The limitations of model uncertainty in adversarial settings).
CRFeb 21, 2017
On the (Statistical) Detection of Adversarial ExamplesKathrin Grosse, Praveen Manoharan, Nicolas Papernot et al.
Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or Malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML model. The mitigation of these adversarial inputs remains an open problem. As a step towards understanding adversarial examples, we show that they are not drawn from the same distribution than the original data, and can thus be detected using statistical tests. Using thus knowledge, we introduce a complimentary approach to identify specific inputs that are adversarial. Specifically, we augment our ML model with an additional output, in which the model is trained to classify all adversarial inputs. We evaluate our approach on multiple adversarial example crafting methods (including the fast gradient sign and saliency map methods) with several datasets. The statistical test flags sample sets containing adversarial inputs confidently at sample sizes between 10 and 100 data points. Furthermore, our augmented model either detects adversarial examples as outliers with high accuracy (> 80%) or increases the adversary's cost - the perturbation added - by more than 150%. In this way, we show that statistical properties of adversarial examples are essential to their detection.
CRJun 14, 2016
Adversarial Perturbations Against Deep Neural Networks for Malware ClassificationKathrin Grosse, Nicolas Papernot, Praveen Manoharan et al.
Deep neural networks, like many other machine learning models, have recently been shown to lack robustness against adversarially crafted inputs. These inputs are derived from regular inputs by minor yet carefully selected perturbations that deceive machine learning models into desired misclassifications. Existing work in this emerging field was largely specific to the domain of image classification, since the high-entropy of images can be conveniently manipulated without changing the images' overall visual appearance. Yet, it remains unclear how such attacks translate to more security-sensitive applications such as malware detection - which may pose significant challenges in sample generation and arguably grave consequences for failure. In this paper, we show how to construct highly-effective adversarial sample crafting attacks for neural networks used as malware classifiers. The application domain of malware classification introduces additional constraints in the adversarial sample crafting problem when compared to the computer vision domain: (i) continuous, differentiable input domains are replaced by discrete, often binary inputs; and (ii) the loose condition of leaving visual appearance unchanged is replaced by requiring equivalent functional behavior. We demonstrate the feasibility of these attacks on many different instances of malware classifiers that we trained using the DREBIN Android malware data set. We furthermore evaluate to which extent potential defensive mechanisms against adversarial crafting can be leveraged to the setting of malware classification. While feature reduction did not prove to have a positive impact, distillation and re-training on adversarially crafted samples show promising results.