Hassan Sartaj

SE
h-index27
10papers
38citations
Novelty29%
AI Score43

10 Papers

SEMay 6
Software Engineering for Self-Adaptive Robotics: A Research Agenda

Hassan Sartaj, Shaukat Ali, Ana Cavalcanti et al.

Self-adaptive robotic systems operate autonomously in dynamic and uncertain environments, requiring robust real-time monitoring and adaptive behaviour. Unlike traditional robotic software with predefined logic, self-adaptive robots exploit artificial intelligence (AI), machine learning, and model-driven engineering to adapt continuously to changing conditions, thereby ensuring reliability, safety, and optimal performance. This paper presents a research agenda for software engineering in self-adaptive robotics, structured along two dimensions. The first concerns the software engineering lifecycle, requirements, design, development, testing, and operations, tailored to the challenges of self-adaptive robotics. The second focuses on enabling technologies such as digital twins and AI-driven adaptation, which support runtime monitoring, fault detection, and automated decision-making. We identify open challenges, including verifying adaptive behaviours under uncertainty, balancing trade-offs between adaptability, performance, and safety, and integrating self-adaptation frameworks like MAPE K/MAPLE-K. By consolidating these challenges into a roadmap toward 2030, this work contributes to the foundations of trustworthy and efficient self-adaptive robotic systems capable of meeting the complexities of real-world deployment.

ROMay 8
Search-based Robustness Testing of Laptop Refurbishing Robotic Software

Erblin Isaku, Hassan Sartaj, Shaukat Ali et al.

The Danish Technological Institute (DTI) focuses on transferring advanced technologies (including robots) to the industry and the public sector. One key application is laptop refurbishment using specialized robots, aimed at promoting reuse, reducing electronic waste, and supporting the European Circular Economy Action Plan. The software of such robots often includes features that use object detection models to detect objects for various purposes, such as identifying screws for laptop disassembly or detecting stickers to remove them. Ensuring the robustness of such models to small input variations remains a critical challenge, and addressing it is important to avoid potential damage to laptops during refurbishment. In this paper, we propose PROBE, a search-based robustness testing approach that leverages multi-objective optimization to identify minimal, localized perturbations that expose failures in object detection models used in the software of laptop refurbishing robots. PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases. Results show that PROBE is 3$\times$ to 7$\times$ more effective than random search in generating failure-inducing perturbations, while requiring smaller perturbation magnitudes, and that the generated perturbations transfer across models. We further show that metamorphic relations provide additional insights into model robustness, enabling the assessment of stability even in non-failing cases.

ROMay 4
Human-in-the-Loop Uncertainty Analysis in Self-Adaptive Robots Using LLMs

Hassan Sartaj, Jalil Boudjadar, Mirgita Frasheri et al.

Self-adaptive robots operate in dynamic, unpredictable environments where unaddressed uncertainties can lead to safety violations and operational failures. However, systematically identifying and analyzing these uncertainties, including their sources, impacts, and mitigation strategies, remains a significant challenge given the inherent complexity of real-world environments, dynamic robotic behavior, and the rapid evolution of robotic technologies. To address this, we introduce RoboULM, a human-in-the-loop methodology and tool that supports practitioners in systematically exploring uncertainties at the design stage using large language models (LLMs). Moreover, we present an uncertainty taxonomy that provides a detailed catalog of uncertainties in self-adaptive robots. We evaluated RoboULM with 16 practitioners from four industrial use cases. The results show that RoboULM was perceived as both useful and easy to understand, with the participants particularly valuing structured prompting and iterative refinement support. These findings demonstrate the potential of RoboULM as a viable solution for systematic uncertainty analysis in complex robots.

SEMar 23, 2024
Automated System-level Testing of Unmanned Aerial Systems

Hassan Sartaj, Asmar Muqeet, Muhammad Zohaib Iqbal et al.

Unmanned aerial systems (UAS) rely on various avionics systems that are safety-critical and mission-critical. A major requirement of international safety standards is to perform rigorous system-level testing of avionics software systems. The current industrial practice is to manually create test scenarios, manually/automatically execute these scenarios using simulators, and manually evaluate outcomes. The test scenarios typically consist of setting certain flight or environment conditions and testing the system under test in these settings. The state-of-the-art approaches for this purpose also require manual test scenario development and evaluation. In this paper, we propose a novel approach to automate the system-level testing of the UAS. The proposed approach (AITester) utilizes model-based testing and artificial intelligence (AI) techniques to automatically generate, execute, and evaluate various test scenarios. The test scenarios are generated on the fly, i.e., during test execution based on the environmental context at runtime. The approach is supported by a toolset. We empirically evaluate the proposed approach on two core components of UAS, an autopilot system of an unmanned aerial vehicle (UAV) and cockpit display systems (CDS) of the ground control station (GCS). The results show that the AITester effectively generates test scenarios causing deviations from the expected behavior of the UAV autopilot and reveals potential flaws in the GCS-CDS.

ROApr 28, 2025
Digital Twin-based Out-of-Distribution Detection in Autonomous Vessels

Erblin Isaku, Hassan Sartaj, Shaukat Ali

An autonomous vessel (AV) is a complex cyber-physical system (CPS) with software enabling many key functionalities, e.g., navigation software enables an AV to autonomously or semi-autonomously follow a path to its destination. Digital twins of such AVs enable advanced functionalities such as running what-if scenarios, performing predictive maintenance, and enabling fault diagnosis. Due to technological improvements, real-time analyses using continuous data from vessels' real-time operations have become increasingly possible. However, the literature has little explored developing advanced analyses in real-time data in AVs with digital twins built with machine learning techniques. To this end, we present a novel digital twin-based approach (ODDIT) to detect future out-of-distribution (OOD) states of an AV before reaching them, enabling proactive intervention. Such states may indicate anomalies requiring attention (e.g., manual correction by the ship master) and assist testers in scenario-centered testing. The digital twin consists of two machine-learning models predicting future vessel states and whether the predicted state will be OOD. We evaluated ODDIT with five vessels across waypoint and zigzag maneuvering under simulated conditions, including sensor and actuator noise and environmental disturbances i.e., ocean current. ODDIT achieved high accuracy in detecting OOD states, with AUROC and TNR@TPR95 scores reaching 99\% across multiple vessels.

SEFeb 16, 2024
LLMs in the Heart of Differential Testing: A Case Study on a Medical Rule Engine

Erblin Isaku, Christoph Laaber, Hassan Sartaj et al.

The Cancer Registry of Norway (CRN) uses an automated cancer registration support system (CaReSS) to support core cancer registry activities, i.e, data capture, data curation, and producing data products and statistics for various stakeholders. GURI is a core component of CaReSS, which is responsible for validating incoming data with medical rules. Such medical rules are manually implemented by medical experts based on medical standards, regulations, and research. Since large language models (LLMs) have been trained on a large amount of public information, including these documents, they can be employed to generate tests for GURI. Thus, we propose an LLM-based test generation and differential testing approach (LLMeDiff) to test GURI. We experimented with four different LLMs, two medical rule engine implementations, and 58 real medical rules to investigate the hallucination, success, time efficiency, and robustness of the LLMs to generate tests, and these tests' ability to find potential issues in GURI. Our results showed that GPT-3.5 hallucinates the least, is the most successful, and is generally the most robust; however, it has the worst time efficiency. Our differential testing revealed 22 medical rules where implementation inconsistencies were discovered (e.g., regarding handling rule versions). Finally, we provide insights for practitioners and researchers based on the results.

ROSep 16, 2025
Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins

Erblin Isaku, Hassan Sartaj, Shaukat Ali et al.

Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstruction error with predictive variance, the digital twin effectively detects OOD behaviors, even in previously unseen conditions. The digital twin also includes an explainability layer that links potential OOD to specific SAR states, offering insights for self-adaptation. We evaluated ODiSAR by creating digital twins of two industrial robots: one navigating an office environment, and another performing maritime ship navigation. In both cases, ODiSAR forecasts SAR behaviors (i.e., robot trajectories and vessel motion) and proactively detects OOD events. Our results showed that ODiSAR achieved high detection performance -- up to 98\% AUROC, 96\% TNR@TPR95, and 95\% F1-score -- while providing interpretable insights to support self-adaptation.

SEJan 7, 2024
Efficient Test Data Generation for MC/DC with OCL and Search

Hassan Sartaj, Muhammad Zohaib Iqbal, Atif Aftab Ahmed Jilani et al.

System-level testing of avionics software systems requires compliance with different international safety standards such as DO-178C. An important consideration of the avionics industry is automated test data generation according to the criteria suggested by safety standards. One of the recommended criteria by DO-178C is the modified condition/decision coverage (MC/DC) criterion. The current model-based test data generation approaches use constraints written in Object Constraint Language (OCL), and apply search techniques to generate test data. These approaches either do not support MC/DC criterion or suffer from performance issues while generating test data for large-scale avionics systems. In this paper, we propose an effective way to automate MC/DC test data generation during model-based testing. We develop a strategy that utilizes case-based reasoning (CBR) and range reduction heuristics designed to solve MC/DC-tailored OCL constraints. We performed an empirical study to compare our proposed strategy for MC/DC test data generation using CBR, range reduction, both CBR and range reduction, with an original search algorithm, and random search. We also empirically compared our strategy with existing constraint-solving approaches. The results show that both CBR and range reduction for MC/DC test data generation outperform the baseline approach. Moreover, the combination of both CBR and range reduction for MC/DC test data generation is an effective approach compared to existing constraint solvers.

SEMay 26, 2025
Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap

Hassan Sartaj, Shaukat Ali, Paolo Arcaini et al.

Search-based software engineering (SBSE), which integrates metaheuristic search techniques with software engineering, has been an active area of research for about 25 years. It has been applied to solve numerous problems across the entire software engineering lifecycle and has demonstrated its versatility in multiple domains. With recent advances in AI, particularly the emergence of foundation models (FMs) such as large language models (LLMs), the evolution of SBSE alongside these models remains undetermined. In this window of opportunity, we present a research roadmap that articulates the current landscape of SBSE in relation to FMs, identifies open challenges, and outlines potential research directions to advance SBSE through its integration and interplay with FMs. Specifically, we analyze five core aspects: leveraging FMs for SBSE design, applying FMs to complement SBSE in SE problems, employing SBSE to address FM challenges, adapting SBSE practices for FMs tailored to SE activities, and exploring the synergistic potential between SBSE and FMs. Furthermore, we present a forward-thinking perspective that envisions the future of SBSE in the era of FMs, highlighting promising research opportunities to address challenges in emerging domains.

SEJan 22, 2020
CDST: A Toolkit for Testing Cockpit Display Systems of Avionics

Hassan Sartaj, Muhammad Zohaib Iqbal, Muhammad Uzair Khan

Avionics are highly critical systems that require extensive testing governed by international safety standards. Cockpit Display Systems (CDS) are an essential component of modern aircraft cockpits and display information from the user application using various widgets. A significant step in the testing of avionics is to evaluate whether these CDS are displaying the correct information. A common industrial practice is to manually test the information on these CDS by taking the aircraft into different scenarios during the simulation. Given the large number of scenarios to test, manual testing of such behavior is a laborious activity. In this paper, we present a CDST toolkit that automates the testing of Cockpit Display Systems (CDS). We discuss the workflow and architecture of the tool and also demonstrates the tool on an industrial case study. The results show that the tool is able to generate, execute, and evaluate the test cases and identify 3 bugs in the case study.