Simos Gerasimou

SE
h-index34
18papers
434citations
Novelty47%
AI Score53

18 Papers

ROMar 15, 2023
Bayesian Learning for the Robust Verification of Autonomous Robots

Xingyu Zhao, Simos Gerasimou, Radu Calinescu et al.

Autonomous robots used in infrastructure inspection, space exploration and other critical missions operate in highly dynamic environments. As such, they must continually verify their ability to complete the tasks associated with these missions safely and effectively. Here we present a Bayesian learning framework that enables this runtime verification of autonomous robots. The framework uses prior knowledge and observations of the verified robot to learn expected ranges for the occurrence rates of regular and singular (e.g., catastrophic failure) events. Interval continuous-time Markov models defined using these ranges are then analysed to obtain expected intervals of variation for system properties such as mission duration and success probability. We apply the framework to an autonomous robotic mission for underwater infrastructure inspection and repair. The formal proofs and experiments presented in the paper show that our framework produces results that reflect the uncertainty intrinsic to many real-world systems, enabling the robust verification of their quantitative properties under parametric uncertainty.

LGAug 18, 2023
Robust Uncertainty Quantification Using Conformalised Monte Carlo Prediction

Daniel Bethell, Simos Gerasimou, Radu Calinescu

Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model's confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over advanced UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple.

14.1AIApr 22
Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs

Gricel Vázquez, Alexandros Evangelidis, Sepeedeh Shahbeigi et al.

Integrating Large Language Models (LLMs) into complex software systems enables the generation of human-understandable explanations of opaque AI processes, such as automated task planning. However, the quality and reliability of these explanations heavily depend on effective prompt engineering. The lack of a systematic understanding of how diverse stakeholder groups formulate and refine prompts hinders the development of tools that can automate this process. We introduce COMPASS (COgnitive Modelling for Prompt Automated SynthesiS), a proof-of-concept self-adaptive approach that formalises prompt engineering as a cognitive and probabilistic decision-making process. COMPASS models unobservable users' latent cognitive states, such as attention and comprehension, uncertainty, and observable interaction cues as a POMDP, whose synthesised policy enables adaptive generation of explanations and prompt refinements. We evaluate COMPASS using two diverse cyber-physical system case studies to assess the adaptive explanation generation and their qualities, both quantitatively and qualitatively. Our results demonstrate the feasibility of COMPASS integrating human cognition and user profile's feedback into automated prompt synthesis in complex task planning systems.

9.6LOApr 8
Formally Guaranteed Control Adaptation for ODD-Resilient Autonomous Systems

Gricel Vázquez, Calum Imrie, Sepeedeh Shahbeigi et al.

Ensuring reliable performance in situations outside the Operational Design Domain (ODD) remains a primary challenge in devising resilient autonomous systems. We explore this challenge by introducing an approach for adapting probabilistic system models to handle out-of-ODD scenarios while, in parallel, providing quantitative guarantees. Our approach dynamically extends the coverage of existing system situation capabilities, supporting the verification and adaptation of the system's behaviour under unanticipated situations. Preliminary results demonstrate that our approach effectively increases system reliability by adapting its behaviour and providing formal guarantees even under unforeseen out-of-ODD situations.

LGNov 30, 2025
Uncertainty Quantification for Deep Regression using Contextualised Normalizing Flows

Adriel Sosa Marco, John Daniel Kirwan, Alexia Toumpa et al.

Quantifying uncertainty in deep regression models is important both for understanding the confidence of the model and for safe decision-making in high-risk domains. Existing approaches that yield prediction intervals overlook distributional information, neglecting the effect of multimodal or asymmetric distributions on decision-making. Similarly, full or approximated Bayesian methods, while yielding the predictive posterior density, demand major modifications to the model architecture and retraining. We introduce MCNF, a novel post hoc uncertainty quantification method that produces both prediction intervals and the full conditioned predictive distribution. MCNF operates on top of the underlying trained predictive model; thus, no predictive model retraining is needed. We provide experimental evidence that the MCNF-based uncertainty estimate is well calibrated, is competitive with state-of-the-art uncertainty quantification methods, and provides richer information for downstream decision-making tasks.

LGMar 25, 2024
DeepKnowledge: Generalisation-Driven Deep Learning Testing

Sondess Missaoui, Simos Gerasimou, Nikolaos Matragkas

Despite their unprecedented success, DNNs are notoriously fragile to small shifts in data distribution, demanding effective testing techniques that can assess their dependability. Despite recent advances in DNN testing, there is a lack of systematic testing approaches that assess the DNN's capability to generalise and operate comparably beyond data in their training distribution. We address this gap with DeepKnowledge, a systematic testing methodology for DNN-based systems founded on the theory of knowledge generalisation, which aims to enhance DNN robustness and reduce the residual risk of 'black box' models. Conforming to this theory, DeepKnowledge posits that core computational DNN units, termed Transfer Knowledge neurons, can generalise under domain shift. DeepKnowledge provides an objective confidence measurement on testing activities of DNN given data distribution shifts and uses this information to instrument a generalisation-informed test adequacy criterion to check the transfer knowledge capacity of a test set. Our empirical evaluation of several DNNs, across multiple datasets and state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepKnowledge and its ability to support the engineering of more dependable DNNs. We report improvements of up to 10 percentage points over state-of-the-art coverage criteria for detecting adversarial attacks on several benchmarks, including MNIST, SVHN, and CIFAR.

LGOct 21, 2025
Learning to Navigate Under Imperfect Perception: Conformalised Segmentation for Safe Reinforcement Learning

Daniel Bethell, Simos Gerasimou, Radu Calinescu et al.

Reliable navigation in safety-critical environments requires both accurate hazard perception and principled uncertainty handling to strengthen downstream safety handling. Despite the effectiveness of existing approaches, they assume perfect hazard detection capabilities, while uncertainty-aware perception approaches lack finite-sample guarantees. We present COPPOL, a conformal-driven perception-to-policy learning approach that integrates distribution-free, finite-sample safety guarantees into semantic segmentation, yielding calibrated hazard maps with rigorous bounds for missed detections. These maps induce risk-aware cost fields for downstream RL planning. Across two satellite-derived benchmarks, COPPOL increases hazard coverage (up to 6x) compared to comparative baselines, achieving near-complete detection of unsafe regions while reducing hazardous violations during navigation (up to approx 50%). More importantly, our approach remains robust to distributional shift, preserving both safety and efficiency.

LGOct 21, 2025
Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

Daniel Bethell, Simos Gerasimou, Radu Calinescu et al.

Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.

LGSep 29, 2025
Guided Uncertainty Learning Using a Post-Hoc Evidential Meta-Model

Charmaine Barker, Daniel Bethell, Simos Gerasimou

Reliable uncertainty quantification remains a major obstacle to the deployment of deep learning models under distributional shift. Existing post-hoc approaches that retrofit pretrained models either inherit misplaced confidence or merely reshape predictions, without teaching the model when to be uncertain. We introduce GUIDE, a lightweight evidential learning meta-model approach that attaches to a frozen deep learning model and explicitly learns how and when to be uncertain. GUIDE identifies salient internal features via a calibration stage, and then employs these features to construct a noise-driven curriculum that teaches the model how and when to express uncertainty. GUIDE requires no retraining, no architectural modifications, and no manual intermediate-layer selection to the base deep learning model, thus ensuring broad applicability and minimal user intervention. The resulting model avoids distilling overconfidence from the base model, improves out-of-distribution detection by ~77% and adversarial attack detection by ~80%, while preserving in-distribution performance. Across diverse benchmarks, GUIDE consistently outperforms state-of-the-art approaches, evidencing the need for actively guiding uncertainty to close the gap between predictive confidence and reliability.

AIJun 21, 2025
Efficient Strategy Synthesis for MDPs via Hierarchical Block Decomposition

Alexandros Evangelidis, Gricel Vázquez, Simos Gerasimou

Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional policy synthesis methods, they fail to scale to large state spaces. Our approach addresses this issue and accelerates policy synthesis in large MDPs by dynamically refining the MDP and iteratively selecting the most fragile MDP regions for refinement. This iterative procedure offers a balance between accuracy and efficiency, as refinement occurs only when necessary. Through a comprehensive empirical evaluation comprising diverse case studies and MDPs up to 1M states, we demonstrate significant performance improvements yielded by our approach compared to the leading probabilistic model checker PRISM (up to 2x), thus offering a very competitive solution for real-world policy synthesis tasks in larger MDPs.

SEFeb 2, 2021
Fast Parametric Model Checking through Model Fragmentation

Xinwei Fang, Radu Calinescu, Simos Gerasimou et al.

Parametric model checking (PMC) computes algebraic formulae that express key non-functional properties of a system (reliability, performance, etc.) as rational functions of the system and environment parameters. In software engineering, PMC formulae can be used during design, e.g., to analyse the sensitivity of different system architectures to parametric variability, or to find optimal system configurations. They can also be used at runtime, e.g., to check if non-functional requirements are still satisfied after environmental changes, or to select new configurations after such changes. However, current PMC techniques do not scale well to systems with complex behaviour and more than a few parameters. Our paper introduces a fast PMC (fPMC) approach that overcomes this limitation, extending the applicability of PMC to a broader class of systems than previously possible. To this end, fPMC partitions the Markov models that PMC operates with into \emph{fragments} whose reachability properties are analysed independently, and obtains PMC reachability formulae by combining the results of these fragment analyses. To demonstrate the effectiveness of fPMC, we show how our fPMC tool can analyse three systems (taken from the research literature, and belonging to different application domains) with which current PMC techniques and tools struggle.

SEAug 10, 2020
Learning to Learn in Collective Adaptive Systems: Mining Design Patterns for Data-driven Reasoning

Mirko D'Angelo, Sona Ghahremani, Simos Gerasimou et al.

Engineering collective adaptive systems (CAS) with learning capabilities is a challenging task due to their multi-dimensional and complex design space. Data-driven approaches for CAS design could introduce new insights enabling system engineers to manage the CAS complexity more cost-effectively at the design-phase. This paper introduces a systematic approach to reason about design choices and patterns of learning-based CAS. Using data from a systematic literature review, reasoning is performed with a novel application of data-driven methodologies such as clustering, multiple correspondence analysis and decision trees. The reasoning based on past experience as well as supporting novel and innovative design choices are demonstrated.

SEAug 5, 2020
Supporting Robotic Software Migration Using Static Analysis and Model-Driven Engineering

Sophie Wood, Nicholas Matragkas, Dimitris Kolovos et al.

The wide use of robotic systems contributed to developing robotic software highly coupled to the hardware platform running the robotic system. Due to increased maintenance cost or changing business priorities, the robotic hardware is infrequently upgraded, thus increasing the risk for technology stagnation. Reducing this risk entails migrating the system and its software to a new hardware platform. Conventional software engineering practices such as complete re-development and code-based migration, albeit useful in mitigating these obsolescence issues, they are time-consuming and overly expensive. Our RoboSMi model-driven approach supports the migration of the software controlling a robotic system between hardware platforms. First, RoboSMi executes static analysis on the robotic software of the source hardware platform to identify platform-dependent and platform-agnostic software constructs. By analysing a model that expresses the architecture of robotic components on the target platform, RoboSMi establishes the hardware configuration of those components and suggests software libraries for each component whose execution will enable the robotic software to control the components. Finally, RoboSMi through code-generation produces software for the target platform and indicates areas that require manual intervention by robotic engineers to complete the migration. We evaluate the applicability of RoboSMi and analyse the level of automation and performance provided from its use by migrating two robotic systems deployed for an environmental monitoring and a line following mission from a Propeller Activity Board to an Arduino Uno.

SEJul 31, 2020
Genetic Improvement @ ICSE 2020

William B. Langdon, Westley Weimer, Justyna Petke et al.

Following Prof. Mark Harman of Facebook's keynote and formal presentations (which are recorded in the proceedings) there was a wide ranging discussion at the eighth international Genetic Improvement workshop, GI-2020 @ ICSE (held as part of the 42nd ACM/IEEE International Conference on Software Engineering on Friday 3rd July 2020). Topics included industry take up, human factors, explainabiloity (explainability, justifyability, exploitability) and GI benchmarks. We also contrast various recent online approaches (e.g. SBST 2020) to holding virtual computer science conferences and workshops via the WWW on the Internet without face-2-face interaction. Finally we speculate on how the Coronavirus Covid-19 Pandemic will affect research next year and into the future.

SEFeb 9, 2020
Importance-Driven Deep Learning System Testing

Simos Gerasimou, Hasan Ferit Eniser, Alper Sen et al.

Deep Learning (DL) systems are key enablers for engineering intelligent applications due to their ability to solve complex tasks such as image recognition and machine translation. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. Recent research in this direction focuses on adapting testing criteria from traditional software engineering as a means of increasing confidence for their correct behaviour. However, they are inadequate in capturing the intrinsic properties exhibited by these systems. We bridge this gap by introducing DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems, across multiple DL datasets and with state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepImportance and its ability to support the engineering of more robust DL systems.

SEApr 2, 2019
Software Engineering for Intelligent and Autonomous Systems: Report from the GI Dagstuhl Seminar 18343

Simos Gerasimou, Thomas Vogel, Ada Diaconescu

Software systems are increasingly used in application domains characterised by uncertain environments, evolving requirements and unexpected failures; sudden system malfunctioning raises serious issues of security, safety, loss of comfort or revenue. During operation, these systems will likely need to deal with several unpredictable situations including variations in system performance, sudden changes in system workload and component failures. These situations can cause deviation from the desired system behaviour and require dynamic adaptation of the system behaviour, parameters or architecture. Through using closed-loop control, typically realized with software, intelligent and autonomous software systems can dynamically adapt themselves, without any or with limited human involvement, by identifying abnormal situations, analysing alternative adaptation options, and finally, self-adapting to a suitable new configuration. This report summarises the research carried out during SEfIAS GI Dagstuhl seminar which provided a forum for strengthening interaction and collaboration for early-career researchers and practitioners from the research communities of SEAMS, ICAC/ICCAC, SASO, Self-Aware Computing and AAMAS.

SEFeb 15, 2019
DeepFault: Fault Localization for Deep Neural Networks

Hasan Ferit Eniser, Simos Gerasimou, Alper Sen

Deep Neural Networks (DNNs) are increasingly deployed in safety-critical applications including autonomous vehicles and medical diagnostics. To reduce the residual risk for unexpected DNN behaviour and provide evidence for their trustworthy operation, DNNs should be thoroughly tested. The DeepFault whitebox DNN testing approach presented in our paper addresses this challenge by employing suspiciousness measures inspired by fault localization to establish the hit spectrum of neurons and identify suspicious neurons whose weights have not been calibrated correctly and thus are considered responsible for inadequate DNN performance. DeepFault also uses a suspiciousness-guided algorithm to synthesize new inputs, from correctly classified inputs, that increase the activation values of suspicious neurons. Our empirical evaluation on several DNN instances trained on MNIST and CIFAR-10 datasets shows that DeepFault is effective in identifying suspicious neurons. Also, the inputs synthesized by DeepFault closely resemble the original inputs, exercise the identified suspicious neurons and are highly adversarial.

SEMar 18, 2017
Engineering Trustworthy Self-Adaptive Software with Dynamic Assurance Cases

Radu Calinescu, Danny Weyns, Simos Gerasimou et al.

Building on concepts drawn from control theory, self-adaptive software handles environmental and internal uncertainties by dynamically adjusting its architecture and parameters in response to events such as workload changes and component failures. Self-adaptive software is increasingly expected to meet strict functional and non-functional requirements in applications from areas as diverse as manufacturing, healthcare and finance. To address this need, we introduce a methodology for the systematic ENgineering of TRUstworthy Self-adaptive sofTware (ENTRUST). ENTRUST uses a combination of (1) design-time and runtime modelling and verification, and (2) industry-adopted assurance processes to develop trustworthy self-adaptive software and assurance cases arguing the suitability of the software for its intended application. To evaluate the effectiveness of our methodology, we present a tool-supported instance of ENTRUST and its use to develop proof-of-concept self-adaptive software for embedded and service-based systems from the oceanic monitoring and e-finance domains, respectively. The experimental results show that ENTRUST can be used to engineer self-adaptive software systems in different application domains and to generate dynamic assurance cases for these systems.