AISep 26, 2024
Autonomous Network Defence using Reinforcement LearningMyles Foley, Chris Hicks, Kate Highnam et al.
In the network security arms race, the defender is significantly disadvantaged as they need to successfully detect and counter every malicious attack. In contrast, the attacker needs to succeed only once. To level the playing field, we investigate the effectiveness of autonomous agents in a realistic network defence scenario. We first outline the problem, provide the background on reinforcement learning and detail our proposed agent design. Using a network environment simulation, with 13 hosts spanning 3 subnets, we train a novel reinforcement learning agent and show that it can reliably defend continual attacks by two advanced persistent threat (APT) red agents: one with complete knowledge of the network layout and another which must discover resources through exploration but is more general.
LGOct 20, 2023
Reward Shaping for Happier Autonomous Cyber Security AgentsElizabeth Bates, Vasilios Mavroudis, Chris Hicks
As machine learning models become more capable, they have exhibited increased potential in solving complex tasks. One of the most promising directions uses deep reinforcement learning to train autonomous agents in computer network defense tasks. This work studies the impact of the reward signal that is provided to the agents when training for this task. Due to the nature of cybersecurity tasks, the reward signal is typically 1) in the form of penalties (e.g., when a compromise occurs), and 2) distributed sparsely across each defense episode. Such reward characteristics are atypical of classic reinforcement learning tasks where the agent is regularly rewarded for progress (cf. to getting occasionally penalized for failures). We investigate reward shaping techniques that could bridge this gap so as to enable agents to train more sample-efficiently and potentially converge to a better performance. We first show that deep reinforcement learning algorithms are sensitive to the magnitude of the penalties and their relative size. Then, we combine penalties with positive external rewards and study their effect compared to penalty-only training. Finally, we evaluate intrinsic curiosity as an internal positive reward mechanism and discuss why it might not be as advantageous for high-level network monitoring tasks.
CRJun 15, 2023
Inroads into Autonomous Network Defence using Explained Reinforcement LearningMyles Foley, Mia Wang, Zoe M et al.
Computer network defence is a complicated task that has necessitated a high degree of human involvement. However, with recent advancements in machine learning, fully autonomous network defence is becoming increasingly plausible. This paper introduces an end-to-end methodology for studying attack strategies, designing defence agents and explaining their operation. First, using state diagrams, we visualise adversarial behaviour to gain insight about potential points of intervention and inform the design of our defensive models. We opt to use a set of deep reinforcement learning agents trained on different parts of the task and organised in a shallow hierarchy. Our evaluation shows that the resulting design achieves a substantial performance improvement compared to prior work. Finally, to better investigate the decision-making process of our agents, we complete our analysis with a feature ablation and importance study.
CRMay 21
Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is HardSahar Abdelnabi, Chris Hicks, Konrad Rieck et al.
The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.
CRApr 9
Building Better Environments for Autonomous Cyber DefenceChris Hicks, Elizabeth Bates, Shae McFadden et al.
In November 2025, the authors ran a workshop on the topic of what makes a good reinforcement learning (RL) environment for autonomous cyber defence (ACD). This paper details the knowledge shared by participants both during the workshop and shortly afterwards by contributing herein. The workshop participants come from academia, industry, and government, and have extensive hands-on experience designing and working with RL and cyber environments. While there is now a sizeable body of literature describing work in RL for ACD, there is nevertheless a great deal of tradecraft, domain knowledge, and common hazards which are not detailed comprehensively in a single resource. With a specific focus on building better environments to train and evaluate autonomous RL agents in network defence scenarios, including government and critical infrastructure networks, the contributions of this work are twofold: (1) a framework for decomposing the interface between RL cyber environments and real systems, and (2) guidelines on current best practice for RL-based ACD environment development and agent evaluation, based on the key findings from our workshop.
LGJun 23, 2023
Nearest Neighbour with Bandit FeedbackStephen Pasteris, Chris Hicks, Vasilios Mavroudis
In this paper we adapt the nearest neighbour rule to the contextual bandit problem. Our algorithm handles the fully adversarial setting in which no assumptions at all are made about the data-generation process. When combined with a sufficiently fast data-structure for (perhaps approximate) adaptive nearest neighbour search, such as a navigating net, our algorithm is extremely efficient - having a per trial running time polylogarithmic in both the number of trials and actions, and taking only quasi-linear space. We give generic regret bounds for our algorithm and further analyse them when applied to the stochastic bandit problem in euclidean space. We note that our algorithm can also be applied to the online classification problem.
LGJul 21, 2024
Mitigating Deep Reinforcement Learning Backdoors in the Neural Activation SpaceSanyam Vyas, Chris Hicks, Vasilios Mavroudis
This paper investigates the threat of backdoors in Deep Reinforcement Learning (DRL) agent policies and proposes a novel method for their detection at runtime. Our study focuses on elusive in-distribution backdoor triggers. Such triggers are designed to induce a deviation in the behaviour of a backdoored agent while blending into the expected data distribution to evade detection. Through experiments conducted in the Atari Breakout environment, we demonstrate the limitations of current sanitisation methods when faced with such triggers and investigate why they present a challenging defence problem. We then evaluate the hypothesis that backdoor triggers might be easier to detect in the neural activation space of the DRL agent's policy network. Our statistical analysis shows that indeed the activation patterns in the agent's policy network are distinct in the presence of a trigger, regardless of how well the trigger is concealed in the environment. Based on this, we propose a new defence approach that uses a classifier trained on clean environment samples and detects abnormal activations. Our results show that even lightweight classifiers can effectively prevent malicious actions with considerable accuracy, indicating the potential of this research direction even against sophisticated adversaries.
GTAug 4, 2024
Environment Complexity and Nash Equilibria in a Sequential Social DilemmaMustafa Yasir, Andrew Howes, Vasilios Mavroudis et al.
Multi-agent reinforcement learning (MARL) methods, while effective in zero-sum or positive-sum games, often yield suboptimal outcomes in general-sum games where cooperation is essential for achieving globally optimal outcomes. Matrix game social dilemmas, which abstract key aspects of general-sum interactions, such as cooperation, risk, and trust, fail to model the temporal and spatial dynamics characteristic of real-world scenarios. In response, our study extends matrix game social dilemmas into more complex, higher-dimensional MARL environments. We adapt a gridworld implementation of the Stag Hunt dilemma to more closely match the decision-space of a one-shot matrix game while also introducing variable environment complexity. Our findings indicate that as complexity increases, MARL agents trained in these environments converge to suboptimal strategies, consistent with the risk-dominant Nash equilibria strategies found in matrix games. Our work highlights the impact of environment complexity on achieving optimal outcomes in higher-dimensional game-theoretic MARL environments.
CRMar 19
On The Effectiveness of the UK NIS Regulations as a Mandatory Cybersecurity Reporting RegimeJunade Ali, Chris Hicks
Existing cybersecurity literature lacks a source of empirical, representative data as to the true nature of cyberattacks on Critical National Infrastructure. We have obtained UK-wide data on incidents reported under the Network and Information Systems (NIS) Regulations in 2024 causing "a significant impact on the continuity" of essential services and comparator data from intelligence agencies. We find that 29% of NIS reports already concern cybersecurity incidents. As the UK Government seeks to extend cybersecurity reporting, we find the NIS Regulations are limited in their effectiveness; whilst our requests revealed 30 cybersecurity incidents reported under the NIS regulations, there were 89 incidents classified as "highly significant and significant" captured by the National Cyber Security Centre in the 2024 reporting year. Whereas 36% of Cybersecurity and Infrastructure Security Agency reported attacks concerned espionage, from NIS data we find 100% NIS-reportable cyberattacks concerning healthcare systems in England in 2024 were ransomware.
LGFeb 4
Beyond Rewards in Reinforcement Learning for Cyber DefenceElizabeth Bates, Chris Hicks, Vasilios Mavroudis
Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.
LGFeb 9
SoK: The Pitfalls of Deep Reinforcement Learning for CybersecurityShae McFadden, Myles Foley, Elizabeth Bates et al.
Deep Reinforcement Learning (DRL) has achieved remarkable success in domains requiring sequential decision-making, motivating its application to cybersecurity problems. However, transitioning DRL from laboratory simulations to bespoke cyber environments can introduce numerous issues. This is further exacerbated by the often adversarial, non-stationary, and partially-observable nature of most cybersecurity tasks. In this paper, we identify and systematize 11 methodological pitfalls that frequently occur in DRL for cybersecurity (DRL4Sec) literature across the stages of environment modeling, agent training, performance evaluation, and system deployment. By analyzing 66 significant DRL4Sec papers (2018-2025), we quantify the prevalence of each pitfall and find an average of over five pitfalls per paper. We demonstrate the practical impact of these pitfalls using controlled experiments in (i) autonomous cyber defense, (ii) adversarial malware creation, and (iii) web security testing environments. Finally, we provide actionable recommendations for each pitfall to support the development of more rigorous and deployable DRL-based security systems.
SEDec 15, 2024Code
SoK: On Closing the Applicability Gap in Automated Vulnerability DetectionEzzeldin Shereen, Dan Ristea, Sanyam Vyas et al.
The frequent discovery of security vulnerabilities in both open-source and proprietary software underscores the urgent need for earlier detection during the development lifecycle. Initiatives such as DARPA's Artificial Intelligence Cyber Challenge (AIxCC) aim to accelerate Automated Vulnerability Detection (AVD), seeking to address this challenge by autonomously analyzing source code to identify vulnerabilities. This paper addresses two primary research questions: (RQ1) How is current AVD research distributed across its core components? (RQ2) What key areas should future research target to bridge the gap in the practical applicability of AVD throughout software development? To answer these questions, we conduct a systematization over 79 AVD articles and 17 empirical studies, analyzing them across five core components: task formulation and granularity, input programming languages and representations, detection approaches and key solutions, evaluation metrics and datasets, and reported performance. Our systematization reveals that the narrow focus of AVD research-mainly on specific tasks and programming languages-limits its practical impact and overlooks broader areas crucial for effective, real-world vulnerability detection. We identify significant challenges, including the need for diversified problem formulations, varied detection granularities, broader language support, better dataset quality, enhanced reproducibility, and increased practical impact. Based on these findings we identify research directions that will enhance the effectiveness and applicability of AVD solutions in software security.
LGNov 14, 2024Code
Inherently Interpretable and Uncertainty-Aware Models for Online Learning in Cyber-Security ProblemsBenjamin Kolicic, Alberto Caron, Chris Hicks et al.
In this paper, we address the critical need for interpretable and uncertainty-aware machine learning models in the context of online learning for high-risk industries, particularly cyber-security. While deep learning and other complex models have demonstrated impressive predictive capabilities, their opacity and lack of uncertainty quantification present significant questions about their trustworthiness. We propose a novel pipeline for online supervised learning problems in cyber-security, that harnesses the inherent interpretability and uncertainty awareness of Additive Gaussian Processes (AGPs) models. Our approach aims to balance predictive performance with transparency while improving the scalability of AGPs, which represents their main drawback, potentially enabling security analysts to better validate threat detection, troubleshoot and reduce false positives, and generally make trustworthy, informed decisions. This work contributes to the growing field of interpretable AI by proposing a class of models that can be significantly beneficial for high-stake decision problems such as the ones typical of the cyber-security domain. The source code is available.
CRDec 10, 2021Code
An Interface between Legacy and Modern Mobile Devices for Digital IdentityVasilios Mavroudis, Chris Hicks, Jon Crowcroft
In developing regions a substantial number of users rely on legacy and ultra-low-cost mobile devices. Unfortunately, many of these devices are not equipped to run the standard authentication or identity apps that are available for smartphones. Increasingly, apps that display Quick Response (QR) codes are being used to communicate personal credentials (e.g., Covid-19 vaccination certificates). This paper describes a novel interface for QR code credentials that is compatible with legacy mobile devices. Our solution, which we have released under open source licensing, allows Web Application Enabled legacy mobile devices to load and display standard QR codes. This technique makes modern identity platforms available to previously excluded and economically disadvantaged populations.
CRJun 25, 2020Code
Differentially Private Health Tokens for Estimating COVID-19 RiskDavid Butler, Chris Hicks, James Bell et al.
In the fight against Covid-19, many governments and businesses are in the process of evaluating, trialling and even implementing so-called immunity passports. Also known as antibody or health certificates, there is a clear demand for any technology that could allow people to return to work and other crowded places without placing others at risk. One of the major criticisms of such systems is that they could be misused to unfairly discriminate against those without immunity, allowing the formation of an `immuno-privileged' class of people. In this work we are motivated to explore an alternative technical solution that is non-discriminatory by design. In particular we propose health tokens -- randomised health certificates which, using methods from differential privacy, allow individual test results to be randomised whilst still allowing useful aggregate risk estimates to be calculated. We show that health tokens could mitigate immunity-based discrimination whilst still presenting a viable mechanism for estimating the collective transmission risk posed by small groups of users. We evaluate the viability of our approach in the context of identity-free and identity-binding use cases and then consider a number of possible attacks. Our experimental results show that for groups of size 500 or more, the error associated with our method can be as low as 0.03 on average and thus the aggregated results can be useful in a number of identity-free contexts. Finally, we present the results of our open-source prototype which demonstrates the practicality of our solution.
CRDec 8, 2023
Canaries and Whistles: Resilient Drone Communication Networks with (or without) Deep Reinforcement LearningChris Hicks, Vasilios Mavroudis, Myles Foley et al.
Communication networks able to withstand hostile environments are critically important for disaster relief operations. In this paper, we consider a challenging scenario where drones have been compromised in the supply chain, during their manufacture, and harbour malicious software capable of wide-ranging and infectious disruption. We investigate multi-agent deep reinforcement learning as a tool for learning defensive strategies that maximise communications bandwidth despite continual adversarial interference. Using a public challenge for learning network resilience strategies, we propose a state-of-the-art expert technique and study its superiority over deep reinforcement learning agents. Correspondingly, we identify three specific methods for improving the performance of our learning-based agents: (1) ensuring each observation contains the necessary information, (2) using expert agents to provide a curriculum for learning, and (3) paying close attention to reward. We apply our methods and present a new mixed strategy enabling expert and learning-based agents to work together and improve on all prior results.
LGOct 8, 2025
Poisoning Attacks on LLMs Require a Near-constant Number of Poison SamplesAlexandra Souly, Javier Rando, Ed Chapman et al.
Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.
LGOct 23, 2024
Entity-based Reinforcement Learning for Autonomous Cyber DefenceIsaac Symes Thompson, Alberto Caron, Chris Hicks et al.
A significant challenge for autonomous cyber defence is ensuring a defensive agent's ability to generalise across diverse network topologies and configurations. This capability is necessary for agents to remain effective when deployed in dynamically changing environments, such as an enterprise network where devices may frequently join and leave. Standard approaches to deep reinforcement learning, where policies are parameterised using a fixed-input multi-layer perceptron (MLP) expect fixed-size observation and action spaces. In autonomous cyber defence, this makes it hard to develop agents that generalise to environments with network topologies different from those trained on, as the number of nodes affects the natural size of the observation and action spaces. To overcome this limitation, we reframe the problem of autonomous network defence using entity-based reinforcement learning, where the observation and action space of an agent are decomposed into a collection of discrete entities. This framework enables the use of policy parameterisations specialised in compositional generalisation. We train a Transformer-based policy on the Yawning Titan cyber-security simulation environment and test its generalisation capabilities across various network topologies. We demonstrate that this approach significantly outperforms an MLP-based policy when training across fixed-size networks of varying topologies, and matches performance when training on a single network. We also demonstrate the potential for zero-shot generalisation to networks of a different size to those seen in training. These findings highlight the potential for entity-based reinforcement learning to advance the field of autonomous cyber defence by providing more generalisable policies capable of handling variations in real-world network environments.
CROct 18, 2024
CybORG++: An Enhanced Gym for the Development of Autonomous Cyber AgentsHarry Emerson, Liz Bates, Chris Hicks et al.
CybORG++ is an advanced toolkit for reinforcement learning research focused on network defence. Building on the CAGE 2 CybORG environment, it introduces key improvements, including enhanced debugging capabilities, refined agent implementation support, and a streamlined environment that enables faster training and easier customisation. Along with addressing several software bugs from its predecessor, CybORG++ introduces MiniCAGE, a lightweight version of CAGE 2, which improves performance dramatically, up to 1000x faster execution in parallel iterations, without sacrificing accuracy or core functionality. CybORG++ serves as a robust platform for developing and evaluating defensive agents, making it a valuable resource for advancing enterprise network defence research.
LGMay 5, 2024
A View on Out-of-Distribution Identification from a Statistical Testing Theory PerspectiveAlberto Caron, Chris Hicks, Vasilios Mavroudis
We study the problem of efficiently detecting Out-of-Distribution (OOD) samples at test time in supervised and unsupervised learning contexts. While ML models are typically trained under the assumption that training and test data stem from the same distribution, this is often not the case in realistic settings, thus reliably detecting distribution shifts is crucial at deployment. We re-formulate the OOD problem under the lenses of statistical testing and then discuss conditions that render the OOD problem identifiable in statistical terms. Building on this framework, we study convergence guarantees of an OOD test based on the Wasserstein distance, and provide a simple empirical evaluation.
LGAug 26, 2025
DRMD: Deep Reinforcement Learning for Malware Detection under Concept DriftShae McFadden, Myles Foley, Mario D'Onghia et al.
Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved an average AUT improvement of 8.66 and 10.90 for the classification-only and classification-rejection policies, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic setting of Android malware detection.
CLApr 2, 2025
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single ImageEzzeldin Shereen, Dan Ristea, Shae McFadden et al.
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
LGJan 24, 2025
An Attentive Graph Agent for Topology-Adaptive Cyber DefenceIlya Orson Sandoval, Isaac Symes Thompson, Vasilios Mavroudis et al.
As cyber threats grow increasingly sophisticated, reinforcement learning (RL) is emerging as a promising technique to create intelligent and adaptive cyber defense systems. However, most existing autonomous defensive agents have overlooked the inherent graph structure of computer networks subject to cyber attacks, potentially missing critical information and constraining their adaptability. To overcome these limitations, we developed a custom version of the Cyber Operations Research Gym (CybORG) environment, encoding network state as a directed graph with realistic low-level features. We employ a Graph Attention Network (GAT) architecture to process node, edge, and global features, and adapt its output to be compatible with policy gradient methods in RL. Our GAT-based approach offers key advantages over flattened alternatives: policies that demonstrate resilience to certain types of unexpected dynamic network topology changes, reasonable generalisation to networks of varying sizes within the same structural distribution, and interpretable defensive actions grounded in tangible network properties. We demonstrate that GAT defensive policies can be trained using our low-level directed graph observations, even when unexpected connections arise during simulation. Evaluations across networks of different sizes, but consistent subnetwork structure, show our policies achieve comparable performance to policies trained specifically for each network configuration. Our study contributes to the development of robust cyber defence systems that can better adapt to real-world network security challenges.
LGDec 14, 2023
A Hierarchical Nearest Neighbour Approach to Contextual BanditsStephen Pasteris, Chris Hicks, Vasilios Mavroudis
In this paper we consider the adversarial contextual bandit problem in metric spaces. The paper "Nearest neighbour with bandit feedback" tackled this problem but when there are many contexts near the decision boundary of the comparator policy it suffers from a high regret. In this paper we eradicate this problem, designing an algorithm in which we can hold out any set of contexts when computing our regret term. Our algorithm builds on that of "Nearest neighbour with bandit feedback" and hence inherits its extreme computational efficiency.
LGJul 7, 2025
Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement LearningSanyam Vyas, Alberto Caron, Chris Hicks et al.
Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored. This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space. Existing DRL backdoor research focuses solely on training-time attacks requiring unrealistic access to the training pipeline. In contrast, we reveal critical vulnerabilities across the DRL supply chain where backdoors can be embedded with significantly reduced adversarial privileges. We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data. Empirical and analytical evaluations across six Atari environments show our attacks rival state-of-the-art training-time backdoor attacks while operating under much stricter adversarial constraints. We also demonstrate that InfrectroRL further evades two leading DRL backdoor defenses. These findings challenge the current research focus and highlight the urgent need for robust defenses.
LGJul 3, 2025
On Efficient Bayesian Exploration in Model-Based Reinforcement LearningAlberto Caron, Chris Hicks, Vasilios Mavroudis
In this work, we address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation. Specifically, we focus on a class of exploration bonuses that targets epistemic uncertainty rather than the aleatoric noise inherent in the environment. We prove that these bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment's dynamics and rewards, thereby aligning exploration with genuine knowledge gaps. Our analysis provides formal guarantees for IG-based approaches, which previously lacked theoretical grounding. To enable practical use, we also discuss tractable approximations via sparse variational Gaussian Processes, Deep Kernels and Deep Ensemble models. We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration. We empirically demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks.
LGMar 12, 2025
Towards Causal Model-Based Policy OptimizationAlberto Caron, Vasilios Mavroudis, Chris Hicks
Real-world decision-making problems are often marked by complex, uncertain dynamics that can shift or break under changing conditions. Traditional Model-Based Reinforcement Learning (MBRL) approaches learn predictive models of environment dynamics from queried trajectories and then use these models to simulate rollouts for policy optimization. However, such methods do not account for the underlying causal mechanisms that govern the environment, and thus inadvertently capture spurious correlations, making them sensitive to distributional shifts and limiting their ability to generalize. The same naturally holds for model-free approaches. In this work, we introduce Causal Model-Based Policy Optimization (C-MBPO), a novel framework that integrates causal learning into the MBRL pipeline to achieve more robust, explainable, and generalizable policy learning algorithms. Our approach centers on first inferring a Causal Markov Decision Process (C-MDP) by learning a local Structural Causal Model (SCM) of both the state and reward transition dynamics from trajectories gathered online. C-MDPs differ from classic MDPs in that we can decompose causal dependencies in the environment dynamics via specifying an associated Causal Bayesian Network. C-MDPs allow for targeted interventions and counterfactual reasoning, enabling the agent to distinguish between mere statistical correlations and causal relationships. The learned SCM is then used to simulate counterfactual on-policy transitions and rewards under hypothetical actions (or ``interventions"), thereby guiding policy optimization more effectively. The resulting policy learned by C-MBPO can be shown to be robust to a class of distributional shifts that affect spurious, non-causal relationships in the dynamics. We demonstrate this through some simple experiments involving near and far OOD dynamics drifts.
LGMar 5, 2025
Less is more? Rewards in RL for Cyber DefenceElizabeth Bates, Chris Hicks, Vasilios Mavroudis
The last few years have seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense "scaffolded" reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. This is especially a problem in complex cyber environments where policy weaknesses may not be noticed until exploited by an adversary. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.
LGNov 6, 2024
Fairness with Exponential WeightsStephen Pasteris, Chris Hicks, Vasilios Mavroudis
Motivated by the need to remove discrimination in certain applications, we develop a meta-algorithm that can convert any efficient implementation of an instance of Hedge (or equivalently, an algorithm for discrete bayesian inference) into an efficient algorithm for the equivalent contextual bandit problem which guarantees exact statistical parity on every trial. Relative to any comparator with statistical parity, the resulting algorithm has the same asymptotic regret bound as running the corresponding instance of Exp4 for each protected characteristic independently. Given that our Hedge instance admits non-stationarity we can handle a varying distribution with which to enforce statistical parity with respect to, which is useful when the true population is unknown and needs to be estimated from the data received so far. Via online-to-batch conversion we can handle the equivalent batch classification problem with exact statistical parity, giving us results that we believe are novel and important in their own right.
LGFeb 24, 2024
Extraction PropagationStephen Pasteris, Chris Hicks, Vasilios Mavroudis
Running backpropagation end to end on large neural networks is fraught with difficulties like vanishing gradients and degradation. In this paper we present an alternative architecture composed of many small neural networks that interact with one another. Instead of propagating gradients back through the architecture we propagate vector-valued messages computed via forward passes, which are then used to update the parameters. Currently the performance is conjectured as we are yet to implement the architecture. However, we do back it up with some theory. A previous version of this paper was entitled "Fusion encoder networks" and detailed a slightly different architecture.
CRMay 24, 2020
SecureABC: Secure AntiBody Certificates for COVID-19Chris Hicks, David Butler, Carsten Maple et al.
COVID-19 has resulted in unprecedented social distancing policies being enforced worldwide. As governments seek to restore their economies, open workplaces and permit travel there is a demand for technologies that may alleviate the requirement for social distancing whilst also protecting healthcare services. In this work we explore the controversial technique of so-called immunity passports and present SecureABC: a decentralised, privacy-preserving protocol for issuing and verifying antibody certificates. We consider the implications of antibody certificate systems, develop a set of risk-minimising principles and a security framework for their evaluation, and show that these may be satisfied in practice. Finally, we also develop two additional protocols that minimise individual discrimination but which still allow for collective transmission risk to be estimated. We use these two protocols to illustrate the utility-privacy trade-offs of antibody certificates and their alternatives.
CRApr 8, 2020
TraceSecure: Towards Privacy Preserving Contact TracingJames Bell, David Butler, Chris Hicks et al.
Contact tracing is being widely employed to combat the spread of COVID-19. Many apps have been developed that allow for tracing to be done automatically based off location and interaction data generated by users. There are concerns, however, regarding the privacy and security of users data when using these apps. These concerns are paramount for users who contract the virus, as they are generally required to release all their data. Motivated by the need to protect users privacy we propose two solutions to this problem. Our first solution builds on current "message based" methods and our second leverages ideas from secret sharing and additively homomorphic encryption.