Elizabeth Bates

LG
Semantic Scholar Profile
h-index8
6papers
32citations
Novelty46%
AI Score48

6 Papers

LGOct 20, 2023
Reward Shaping for Happier Autonomous Cyber Security Agents

Elizabeth Bates, Vasilios Mavroudis, Chris Hicks

As machine learning models become more capable, they have exhibited increased potential in solving complex tasks. One of the most promising directions uses deep reinforcement learning to train autonomous agents in computer network defense tasks. This work studies the impact of the reward signal that is provided to the agents when training for this task. Due to the nature of cybersecurity tasks, the reward signal is typically 1) in the form of penalties (e.g., when a compromise occurs), and 2) distributed sparsely across each defense episode. Such reward characteristics are atypical of classic reinforcement learning tasks where the agent is regularly rewarded for progress (cf. to getting occasionally penalized for failures). We investigate reward shaping techniques that could bridge this gap so as to enable agents to train more sample-efficiently and potentially converge to a better performance. We first show that deep reinforcement learning algorithms are sensitive to the magnitude of the penalties and their relative size. Then, we combine penalties with positive external rewards and study their effect compared to penalty-only training. Finally, we evaluate intrinsic curiosity as an internal positive reward mechanism and discuss why it might not be as advantageous for high-level network monitoring tasks.

23.3GRMay 6
Photons x Force: Differentiable Radiation Pressure Modeling

Charles Constant, Elizabeth Bates, Santosh Bhattarai et al.

We propose a system to optimize parametric designs subject to radiation pressure, \ie the effect of light on the motion of objects. This is most relevant in the design of spacecraft, where radiation pressure presents the dominant non-conservative forcing mechanism, which is the case beyond approximately 800 km altitude. Despite its importance, the high computational cost of high-fidelity radiation pressure modeling has limited its use in large-scale spacecraft design, optimization, and space situational awareness applications. We enable this by offering three innovations in the simulation, in representation and in optimization: First, a practical computer graphics-inspired Monte-Carlo (MC) simulation of radiation pressure. The simulation is highly parallel, uses importance sampling and next-event estimation to reduce variance and allows simulating an entire family of designs instead of a single spacecraft as in previous work. Second, we introduce neural networks as a representation of forces from design parameters. This neural proxy model, learned from simulations, is inherently differentiable and can query forces orders of magnitude faster than a full MC simulation. Third, and finally, we demonstrate optimizing inverse radiation pressure designs, such as finding geometry, material or operation parameters that minimizes travel time, maximizes proximity given a desired end-point, minimize thruster fuel, trains mission control policies or allocated compute budget in extraterrestrial compute.

69.6CRApr 9
Building Better Environments for Autonomous Cyber Defence

Chris Hicks, Elizabeth Bates, Shae McFadden et al.

In November 2025, the authors ran a workshop on the topic of what makes a good reinforcement learning (RL) environment for autonomous cyber defence (ACD). This paper details the knowledge shared by participants both during the workshop and shortly afterwards by contributing herein. The workshop participants come from academia, industry, and government, and have extensive hands-on experience designing and working with RL and cyber environments. While there is now a sizeable body of literature describing work in RL for ACD, there is nevertheless a great deal of tradecraft, domain knowledge, and common hazards which are not detailed comprehensively in a single resource. With a specific focus on building better environments to train and evaluate autonomous RL agents in network defence scenarios, including government and critical infrastructure networks, the contributions of this work are twofold: (1) a framework for decomposing the interface between RL cyber environments and real systems, and (2) guidelines on current best practice for RL-based ACD environment development and agent evaluation, based on the key findings from our workshop.

LGFeb 4
Beyond Rewards in Reinforcement Learning for Cyber Defence

Elizabeth Bates, Chris Hicks, Vasilios Mavroudis

Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.

LGFeb 9
SoK: The Pitfalls of Deep Reinforcement Learning for Cybersecurity

Shae McFadden, Myles Foley, Elizabeth Bates et al.

Deep Reinforcement Learning (DRL) has achieved remarkable success in domains requiring sequential decision-making, motivating its application to cybersecurity problems. However, transitioning DRL from laboratory simulations to bespoke cyber environments can introduce numerous issues. This is further exacerbated by the often adversarial, non-stationary, and partially-observable nature of most cybersecurity tasks. In this paper, we identify and systematize 11 methodological pitfalls that frequently occur in DRL for cybersecurity (DRL4Sec) literature across the stages of environment modeling, agent training, performance evaluation, and system deployment. By analyzing 66 significant DRL4Sec papers (2018-2025), we quantify the prevalence of each pitfall and find an average of over five pitfalls per paper. We demonstrate the practical impact of these pitfalls using controlled experiments in (i) autonomous cyber defense, (ii) adversarial malware creation, and (iii) web security testing environments. Finally, we provide actionable recommendations for each pitfall to support the development of more rigorous and deployable DRL-based security systems.

LGMar 5, 2025
Less is more? Rewards in RL for Cyber Defence

Elizabeth Bates, Chris Hicks, Vasilios Mavroudis

The last few years have seen an explosion of interest in autonomous cyber defence agents based on deep reinforcement learning. Such agents are typically trained in a cyber gym environment, also known as a cyber simulator, at least 32 of which have already been built. Most, if not all cyber gyms provide dense "scaffolded" reward functions which combine many penalties or incentives for a range of (un)desirable states and costly actions. Whilst dense rewards help alleviate the challenge of exploring complex environments, yielding seemingly effective strategies from relatively few environment steps; they are also known to bias the solutions an agent can find, potentially towards suboptimal solutions. This is especially a problem in complex cyber environments where policy weaknesses may not be noticed until exploited by an adversary. In this work we set out to evaluate whether sparse reward functions might enable training more effective cyber defence agents. Towards this goal we first break down several evaluation limitations in existing work by proposing a ground truth evaluation score that goes beyond the standard RL paradigm used to train and evaluate agents. By adapting a well-established cyber gym to accommodate our methodology and ground truth score, we propose and evaluate two sparse reward mechanisms and compare them with a typical dense reward. Our evaluation considers a range of network sizes, from 2 to 50 nodes, and both reactive and proactive defensive actions. Our results show that sparse rewards, particularly positive reinforcement for an uncompromised network state, enable the training of more effective cyber defence agents. Furthermore, we show that sparse rewards provide more stable training than dense rewards, and that both effectiveness and training stability are robust to a variety of cyber environment considerations.