CYMar 29, 2022
A Principles-based Ethics Assurance Argument Pattern for AI and Autonomous SystemsZoe Porter, Ibrahim Habli, John McDermid et al.
An assurance case is a structured argument, typically produced by safety engineers, to communicate confidence that a critical or complex system, such as an aircraft, will be acceptably safe within its intended context. Assurance cases often inform third party approval of a system. One emerging proposition within the trustworthy AI and autonomous systems (AI/AS) research community is to use assurance cases to instil justified confidence that specific AI/AS will be ethically acceptable when operational in well-defined contexts. This paper substantially develops the proposition and makes it concrete. It brings together the assurance case methodology with a set of ethical principles to structure a principles-based ethics assurance argument pattern. The principles are justice, beneficence, non-maleficence, and respect for human autonomy, with the principle of transparency playing a supporting role. The argument pattern, shortened to the acronym PRAISE, is described. The objective of the proposed PRAISE argument pattern is to provide a reusable template for individual ethics assurance cases, by which engineers, developers, operators, or regulators could justify, communicate, or challenge a claim about the overall ethical acceptability of the use of a specific AI/AS in a given socio-technical context. We apply the pattern to the hypothetical use case of an autonomous robo-taxi service in a city centre.
AIAug 4, 2023
Unravelling Responsibility for AIZoe Porter, Philippa Ryan, Phillip Morgan et al.
It is widely acknowledged that we need to establish where responsibility lies for the outputs and impacts of AI-enabled systems. This is important to achieve justice and compensation for victims of AI harms, and to inform policy and engineering practice. But without a clear, thorough understanding of what "responsibility" means, deliberations about where responsibility lies will be, at best, unfocused and incomplete and, at worst, misguided. Furthermore, AI-enabled systems exist within a wider ecosystem of actors, decisions, and governance structures, giving rise to complex networks of responsibility relations. To address these issues, this paper presents a conceptual framework of responsibility, accompanied with a graphical notation and general methodology for visualising these responsibility networks and for tracing different responsibility attributions for AI. Taking the three-part formulation "Actor A is responsible for Occurrence O," the framework unravels the concept of responsibility to clarify that there are different possibilities of who is responsible for AI, senses in which they are responsible, and aspects of events they are responsible for. The notation allows these permutations to be represented graphically. The methodology enables users to apply the framework to specific scenarios. The aim is to offer a foundation to support stakeholders from diverse disciplinary backgrounds to discuss and address complex responsibility questions in hypothesised and real-world cases involving AI. The work is illustrated by application to a fictitious scenario of a fatal collision between a crewless, AI-enabled maritime vessel in autonomous mode and a traditional, crewed vessel at sea.
ROFeb 20, 2023
AERoS: Assurance of Emergent Behaviour in Autonomous Robotic SwarmsDhaminda B. Abeywickrama, James Wilson, Suet Lee et al.
The behaviours of a swarm are not explicitly engineered. Instead, they are an emergent consequence of the interactions of individual agents with each other and their environment. This emergent functionality poses a challenge to safety assurance. The main contribution of this paper is a process for the safety assurance of emergent behaviour in autonomous robotic swarms called AERoS, following the guidance on the Assurance of Machine Learning for use in Autonomous Systems (AMLAS). We explore our proposed process using a case study centred on a robot swarm operating a public cloakroom.
LGSep 1, 2022
Review of the AMLAS Methodology for Application in HealthcareShakir Laher, Carla Brackstone, Sara Reis et al.
In recent years, the number of machine learning (ML) technologies gaining regulatory approval for healthcare has increased significantly allowing them to be placed on the market. However, the regulatory frameworks applied to them were originally devised for traditional software, which has largely rule-based behaviour, compared to the data-driven and learnt behaviour of ML. As the frameworks are in the process of reformation, there is a need to proactively assure the safety of ML to prevent patient safety being compromised. The Assurance of Machine Learning for use in Autonomous Systems (AMLAS) methodology was developed by the Assuring Autonomy International Programme based on well-established concepts in system safety. This review has appraised the methodology by consulting ML manufacturers to understand if it converges or diverges from their current safety assurance practices, whether there are gaps and limitations in its structure and if it is fit for purpose when applied to the healthcare domain. Through this work we offer the view that there is clear utility for AMLAS as a safety assurance methodology when applied to healthcare machine learning technologies, although development of healthcare specific supplementary guidance would benefit those implementing the methodology.
CVFeb 13
A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI ModelsYash Deo, Yan Jia, Toni Lassila et al.
Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emph{Overfit/Novelty Index} (ONI) and \emph{Memorization Index} (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.
CLDec 17, 2025
Evaluating Metrics for Safety with LLM-as-JudgesKester Clegg, Richard Hawkins, Ibrahim Habli et al.
LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
CYDec 30, 2023
What's my role? Modelling responsibility for AI-based safety-critical systemsPhilippa Ryan, Zoe Porter, Joanna Al-Qaddoumi et al.
AI-Based Safety-Critical Systems (AI-SCS) are being increasingly deployed in the real world. These can pose a risk of harm to people and the environment. Reducing that risk is an overarching priority during development and operation. As more AI-SCS become autonomous, a layer of risk management via human intervention has been removed. Following an accident it will be important to identify causal contributions and the different responsible actors behind those to learn from mistakes and prevent similar future events. Many authors have commented on the "responsibility gap" where it is difficult for developers and manufacturers to be held responsible for harmful behaviour of an AI-SCS. This is due to the complex development cycle for AI, uncertainty in AI performance, and dynamic operating environment. A human operator can become a "liability sink" absorbing blame for the consequences of AI-SCS outputs they weren't responsible for creating, and may not have understanding of. This cross-disciplinary paper considers different senses of responsibility (role, moral, legal and causal), and how they apply in the context of AI-SCS safety. We use a core concept (Actor(A) is responsible for Occurrence(O)) to create role responsibility models, producing a practical method to capture responsibility relationships and provide clarity on the previously identified responsibility issues. Our paper demonstrates the approach with two examples: a retrospective analysis of the Tempe Arizona fatal collision involving an autonomous vehicle, and a safety focused predictive role-responsibility analysis for an AI-based diabetes co-morbidity predictor. In both examples our primary focus is on safety, aiming to reduce unfair or disproportionate blame being placed on operators or developers. We present a discussion and avenues for future research.
IVMay 12, 2025
Metrics that matter: Evaluating image quality metrics for medical image generationYash Deo, Yan Jia, Toni Lassila et al.
Evaluating generative models for synthetic medical imaging is crucial yet challenging, especially given the high standards of fidelity, anatomical accuracy, and safety required for clinical applications. Standard evaluation of generated images often relies on no-reference image quality metrics when ground truth images are unavailable, but their reliability in this complex domain is not well established. This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data, including tumour and vascular images, providing a representative exemplar for the field. We systematically evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, localised morphological alterations designed to mimic clinically relevant inaccuracies. We then compare these metric scores against model performance on a relevant downstream segmentation task, analysing results across both controlled image perturbations and outputs from different generative model architectures. Our findings reveal significant limitations: many widely-used no-reference image quality metrics correlate poorly with downstream task suitability and exhibit a profound insensitivity to localised anatomical details crucial for clinical validity. Furthermore, these metrics can yield misleading scores regarding distribution shifts, e.g. data memorisation. This reveals the risk of misjudging model readiness, potentially leading to the deployment of flawed tools that could compromise patient safety. We conclude that ensuring generative models are truly fit for clinical purpose requires a multifaceted validation framework, integrating performance on relevant downstream tasks with the cautious interpretation of carefully selected no-reference image quality metrics.
CYDec 9, 2024
Upstream and Downstream AI Safety: Both on the Same River?John McDermid, Yan Jia, Ibrahim Habli
Traditional safety engineering assesses systems in their context of use, e.g. the operational design domain (road layout, speed limits, weather, etc.) for self-driving vehicles (including those using AI). We refer to this as downstream safety. In contrast, work on safety of frontier AI, e.g. large language models which can be further trained for downstream tasks, typically considers factors that are beyond specific application contexts, such as the ability of the model to evade human control, or to produce harmful content, e.g. how to make bombs. We refer to this as upstream safety. We outline the characteristics of both upstream and downstream safety frameworks then explore the extent to which the broad AI safety community can benefit from synergies between these frameworks. For example, can concepts such as common mode failures from downstream safety be used to help assess the strength of AI guardrails? Further, can the understanding of the capabilities and limitations of frontier AI be used to inform downstream safety analysis, e.g. where LLMs are fine-tuned to calculate voyage plans for autonomous vessels? The paper identifies some promising avenues to explore and outlines some challenges in achieving synergy, or a confluence, between upstream and downstream safety frameworks.
AIMar 24, 2025
The case for delegated AI autonomy for Human AI teaming in healthcareYan Jia, Harriet Evans, Zoe Porter et al.
In this paper we propose an advanced approach to integrating artificial intelligence (AI) into healthcare: autonomous decision support. This approach allows the AI algorithm to act autonomously for a subset of patient cases whilst serving a supportive role in other subsets of patient cases based on defined delegation criteria. By leveraging the complementary strengths of both humans and AI, it aims to deliver greater overall performance than existing human-AI teaming models. It ensures safe handling of patient cases and potentially reduces clinician review time, whilst being mindful of AI tool limitations. After setting the approach within the context of current human-AI teaming models, we outline the delegation criteria and apply them to a specific AI-based tool used in histopathology. The potential impact of the approach and the regulatory requirements for its successful implementation are then discussed.
CLNov 20, 2025
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing DialogueZachary Ellis, Jared Joselowitz, Yash Deo et al.
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
AIOct 24, 2025
Out-of-Distribution Detection for Safety Assurance of AI and Autonomous SystemsVictoria J. Hodge, Colin Paterson, Ibrahim Habli
The operational capabilities and application domains of AI-enabled autonomous systems have expanded significantly in recent years due to advances in robotics and machine learning (ML). Demonstrating the safety of autonomous systems rigorously is critical for their responsible adoption but it is challenging as it requires robust methodologies that can handle novel and uncertain situations throughout the system lifecycle, including detecting out-of-distribution (OoD) data. Thus, OOD detection is receiving increased attention from the research, development and safety engineering communities. This comprehensive review analyses OOD detection techniques within the context of safety assurance for autonomous systems, in particular in safety-critical domains. We begin by defining the relevant concepts, investigating what causes OOD and exploring the factors which make the safety assurance of autonomous systems and OOD detection challenging. Our review identifies a range of techniques which can be used throughout the ML development lifecycle and we suggest areas within the lifecycle in which they may be used to support safety assurance arguments. We discuss a number of caveats that system and safety engineers must be aware of when integrating OOD detection into system lifecycles. We conclude by outlining the challenges and future work necessary for the safe development and operation of autonomous systems across a range of domains and applications.
AIAug 26, 2025
MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluationErnest Lim, Yajie Vera He, Jared Joselowitz et al.
Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.
LGJun 23, 2024
Learning Run-time Safety Monitors for Machine Learning ComponentsOzan Vardal, Richard Hawkins, Colin Paterson et al.
For machine learning components used as part of autonomous systems (AS) in carrying out critical tasks it is crucial that assurance of the models can be maintained in the face of post-deployment changes (such as changes in the operating environment of the system). A critical part of this is to be able to monitor when the performance of the model at runtime (as a result of changes) poses a safety risk to the system. This is a particularly difficult challenge when ground truth is unavailable at runtime. In this paper we introduce a process for creating safety monitors for ML components through the use of degraded datasets and machine learning. The safety monitor that is created is deployed to the AS in parallel to the ML component to provide a prediction of the safety risk associated with the model output. We demonstrate the viability of our approach through some initial experiments using publicly available speed sign datasets.
LGSep 1, 2021
The Role of Explainability in Assuring Safety of Machine Learning in HealthcareYan Jia, John McDermid, Tom Lawton et al.
Established approaches to assuring safety-critical systems and software are difficult to apply to systems employing ML where there is no clear, pre-defined specification against which to assess validity. This problem is exacerbated by the "opaque" nature of ML where the learnt model is not amenable to human scrutiny. Explainable AI (XAI) methods have been proposed to tackle this issue by producing human-interpretable representations of ML models which can help users to gain confidence and build trust in the ML system. However, little work explicitly investigates the role of explainability for safety assurance in the context of ML development. This paper identifies ways in which XAI methods can contribute to safety assurance of ML-based systems. It then uses a concrete ML-based clinical decision support system, concerning weaning of patients from mechanical ventilation, to demonstrate how XAI methods can be employed to produce evidence to support safety assurance. The results are also represented in a safety argument to show where, and in what way, XAI methods can contribute to a safety case. Overall, we conclude that XAI methods have a valuable role in safety assurance of ML-based systems in healthcare but that they are not sufficient in themselves to assure safety.
LGFeb 2, 2021
Guidance on the Assurance of Machine Learning in Autonomous Systems (AMLAS)Richard Hawkins, Colin Paterson, Chiara Picardi et al.
Machine Learning (ML) is now used in a range of systems with results that are reported to exceed, under certain conditions, human performance. Many of these systems, in domains such as healthcare , automotive and manufacturing, exhibit high degrees of autonomy and are safety critical. Establishing justified confidence in ML forms a core part of the safety case for these systems. In this document we introduce a methodology for the Assurance of Machine Learning for use in Autonomous Systems (AMLAS). AMLAS comprises a set of safety case patterns and a process for (1) systematically integrating safety assurance into the development of ML components and (2) for generating the evidence base for explicitly justifying the acceptable safety of these components when integrated into autonomous system applications.
LGJan 11, 2021
A Framework for Assurance of Medication Safety using Machine LearningYan Jia, Tom Lawton, John McDermid et al.
Medication errors continue to be the leading cause of avoidable patient harm in hospitals. This paper sets out a framework to assure medication safety that combines machine learning and safety engineering methods. It uses safety analysis to proactively identify potential causes of medication error, based on expert opinion. As healthcare is now data rich, it is possible to augment safety analysis with machine learning to discover actual causes of medication error from the data, and to identify where they deviate from what was predicted in the safety analysis. Combining these two views has the potential to enable the risk of medication errors to be managed proactively and dynamically. We apply the framework to a case study involving thoracic surgery, e.g. oesophagectomy, where errors in giving beta-blockers can be critical to control atrial fibrillation. This case study combines a HAZOP-based safety analysis method known as SHARD with Bayesian network structure learning and process mining to produce the analysis results, showing the potential of the framework for ensuring patient safety, and for transforming the way that safety is managed in complex healthcare environments.
SEMar 18, 2017
Engineering Trustworthy Self-Adaptive Software with Dynamic Assurance CasesRadu Calinescu, Danny Weyns, Simos Gerasimou et al.
Building on concepts drawn from control theory, self-adaptive software handles environmental and internal uncertainties by dynamically adjusting its architecture and parameters in response to events such as workload changes and component failures. Self-adaptive software is increasingly expected to meet strict functional and non-functional requirements in applications from areas as diverse as manufacturing, healthcare and finance. To address this need, we introduce a methodology for the systematic ENgineering of TRUstworthy Self-adaptive sofTware (ENTRUST). ENTRUST uses a combination of (1) design-time and runtime modelling and verification, and (2) industry-adopted assurance processes to develop trustworthy self-adaptive software and assurance cases arguing the suitability of the software for its intended application. To evaluate the effectiveness of our methodology, we present a tool-supported instance of ENTRUST and its use to develop proof-of-concept self-adaptive software for embedded and service-based systems from the oceanic monitoring and e-finance domains, respectively. The experimental results show that ENTRUST can be used to engineer self-adaptive software systems in different application domains and to generate dynamic assurance cases for these systems.
SEApr 27, 2014
Formalism of Requirements for Safety-Critical Software: Where Does the Benefit Come From?Ibrahim Habli, Andrew Rae
Safety and assurance standards often rely on the principle that requirements errors can be minimised by expressing the requirements more formally. Although numerous case studies have shown that the act of formalising previously informal requirements finds requirements errors, this principle is really just a hypothesis. An industrially persuasive causal relationship between formalisation and better requirements has yet to be established. We describe multiple competing explanations for this hypothesis, in terms of the levels of precision, re-formulation, expertise, effort and automation that are typically associated with formalising requirements. We then propose an experiment to distinguish between these explanations, without necessarily excluding the possibility that none of them are correct.