Helge Spieker

AI
h-index12
31papers
480citations
Novelty42%
AI Score53

31 Papers

45.4SEJun 4
Metamorphic Testing with the Rashomon Set: Explanation Faithfulness in Machine Learning

Helge Spieker, Jørn Eirik Betten, Arnaud Gotlieb

Multiple machine learning models can achieve near-equivalent predictive performance on the same task, yet provide divergent feature-based explanations. This is called the Rashomon effect of (explainable) machine learning, and it raises the question of which explanations, if any, are trustworthy. We propose a framework based on metamorphic testing that assesses explanation faithfulness without requiring ground-truth labels by exploring attributed feature importance from post-hoc explanation methods. Five metamorphic relations formalize expected consistency properties between model behavior and feature attributions. We apply this general framework to two tabular regression datasets and two post-hoc explainers (SHAP and LIME) to demonstrate the approach. The framework offers a practical, model-agnostic tool for selecting accurate models with reliable and trustworthy explanations.

LGOct 24, 2023
Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance Using Self-Supervised Deep Learning

Pierre Bernabé, Arnaud Gotlieb, Bruno Legeard et al.

In maritime traffic surveillance, detecting illegal activities, such as illegal fishing or transshipment of illicit products is a crucial task of the coastal administration. In the open sea, one has to rely on Automatic Identification System (AIS) message transmitted by on-board transponders, which are captured by surveillance satellites. However, insincere vessels often intentionally shut down their AIS transponders to hide illegal activities. In the open sea, it is very challenging to differentiate intentional AIS shutdowns from missing reception due to protocol limitations, bad weather conditions or restricting satellite positions. This paper presents a novel approach for the detection of abnormal AIS missing reception based on self-supervised deep learning techniques and transformer models. Using historical data, the trained model predicts if a message should be received in the upcoming minute or not. Afterwards, the model reports on detected anomalies by comparing the prediction with what actually happens. Our method can process AIS messages in real-time, in particular, more than 500 Millions AIS messages per month, corresponding to the trajectories of more than 60 000 ships. The method is evaluated on 1-year of real-world data coming from four Norwegian surveillance satellites. Using related research results, we validated our method by rediscovering already detected intentional AIS shutdowns.

AIAug 24, 2023
Acquiring Qualitative Explainable Graphs for Automated Driving Scene Interpretation

Nassim Belmecheri, Arnaud Gotlieb, Nadjib Lazaar et al.

The future of automated driving (AD) is rooted in the development of robust, fair and explainable artificial intelligence methods. Upon request, automated vehicles must be able to explain their decisions to the driver and the car passengers, to the pedestrians and other vulnerable road users and potentially to external auditors in case of accidents. However, nowadays, most explainable methods still rely on quantitative analysis of the AD scene representations captured by multiple sensors. This paper proposes a novel representation of AD scenes, called Qualitative eXplainable Graph (QXG), dedicated to qualitative spatiotemporal reasoning of long-term scenes. The construction of this graph exploits the recent Qualitative Constraint Acquisition paradigm. Our experimental results on NuScenes, an open real-world multi-modal dataset, show that the qualitative eXplainable graph of an AD scene composed of 40 frames can be computed in real-time and light in space storage which makes it a potentially interesting tool for improved and more trustworthy perception and control processes in AD.

SEJul 26, 2024
Evaluating Human Trajectory Prediction with Metamorphic Testing

Helge Spieker, Nassim Belmecheri, Arnaud Gotlieb et al.

The prediction of human trajectories is important for planning in autonomous systems that act in the real world, e.g. automated driving or mobile robots. Human trajectory prediction is a noisy process, and no prediction does precisely match any future trajectory. It is therefore approached as a stochastic problem, where the goal is to minimise the error between the true and the predicted trajectory. In this work, we explore the application of metamorphic testing for human trajectory prediction. Metamorphic testing is designed to handle unclear or missing test oracles. It is well-designed for human trajectory prediction, where there is no clear criterion of correct or incorrect human behaviour. Metamorphic relations rely on transformations over source test cases and exploit invariants. A setting well-designed for human trajectory prediction where there are many symmetries of expected human behaviour under variations of the input, e.g. mirroring and rescaling of the input data. We discuss how metamorphic testing can be applied to stochastic human trajectory prediction and introduce the Wasserstein Violation Criterion to statistically assess whether a follow-up test case violates a label-preserving metamorphic relation.

LGSep 16, 2024
Safety-Oriented Pruning and Interpretation of Reinforcement Learning Policies

Dennis Gross, Helge Spieker

Pruning neural networks (NNs) can streamline them but risks removing vital parameters from safe reinforcement learning (RL) policies. We introduce an interpretable RL method called VERINTER, which combines NN pruning with model checking to ensure interpretable RL safety. VERINTER exactly quantifies the effects of pruning and the impact of neural connections on complex safety properties by analyzing changes in safety measurements. This method maintains safety in pruned RL policies and enhances understanding of their safety dynamics, which has proven effective in multiple RL settings.

LGSep 16, 2024
Enhancing RL Safety with Counterfactual LLM Reasoning

Dennis Gross, Helge Spieker

Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

LGSep 16, 2024
Efficient Milling Quality Prediction with Explainable Machine Learning

Dennis Gross, Helge Spieker, Arnaud Gotlieb et al.

This paper presents an explainable machine learning (ML) approach for predicting surface roughness in milling. Utilizing a dataset from milling aluminum alloy 2017A, the study employs random forest regression models and feature importance techniques. The key contributions include developing ML models that accurately predict various roughness values and identifying redundant sensors, particularly those for measuring normal cutting force. Our experiments show that removing certain sensors can reduce costs without sacrificing predictive accuracy, highlighting the potential of explainable machine learning to improve cost-effectiveness in machining.

AIDec 19, 2025
Translating the Rashomon Effect to Sequential Decision-Making Tasks

Dennis Gross, Jørn Eirik Betten, Helge Spieker

The Rashomon effect describes the phenomenon where multiple models trained on the same data produce identical predictions while differing in which features they rely on internally. This effect has been studied extensively in classification tasks, but not in sequential decision-making, where an agent learns a policy to achieve an objective by taking actions in an environment. In this paper, we translate the Rashomon effect to sequential decision-making. We define it as multiple policies that exhibit identical behavior, visiting the same states and selecting the same actions, while differing in their internal structure, such as feature attributions. Verifying identical behavior in sequential decision-making differs from classification. In classification, predictions can be directly compared to ground-truth labels. In sequential decision-making with stochastic transitions, the same policy may succeed or fail on any single trajectory due to randomness. We address this using formal verification methods that construct and compare the complete probabilistic behavior of each policy in the environment. Our experiments demonstrate that the Rashomon effect exists in sequential decision-making. We further show that ensembles constructed from the Rashomon set exhibit greater robustness to distribution shifts than individual policies. Additionally, permissive policies derived from the Rashomon set reduce computational requirements for verification while maintaining optimal performance.

AIOct 8, 2025Code
Verifying Memoryless Sequential Decision-making of Large Language Models

Dennis Gross, Helge Spieker, Arnaud Gotlieb

We introduce a tool for rigorous and automated verification of large language model (LLM)- based policies in memoryless sequential decision-making tasks. Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP guided by the LLM's chosen actions. Each state is encoded as a natural language prompt, the LLM's response is parsed into an action, and reachable successor states by the policy are expanded. The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property. In experiments on standard grid world benchmarks, we show that open source LLMs accessed via Ollama can be verified when deterministically seeded, but generally underperform deep reinforcement learning baselines. Our tool natively integrates with Ollama and supports PRISM-specified tasks, enabling continuous benchmarking in user-specified sequential decision-making tasks and laying a practical foundation for formally verifying increasingly capable LLMs.

AINov 9, 2018Code
ITE: A Lightweight Implementation of Stratified Reasoning for Constructive Logical Operators

Arnaud Gotlieb, Dusica Marijan, Helge Spieker

Constraint Programming (CP) is a powerful declarative programming paradigm where inference and search are interleaved to find feasible and optimal solutions to various type of constraint systems. However, handling logical connectors with constructive information in CP is notoriously difficult. This paper presents If Then Else (ITE), a lightweight implementation of stratified constructive reasoning for logical connectives. Stratification is introduced to cope with the risk of combinatorial explosion of constructing information from nested and combined logical operators. ITE is an open-source library built on top of SICStus Prolog clp(fd), which proposes various operators, including constructive disjunction and negation, constructive implication and conditional. These operators can be used to express global constraints and to benefit from constructive reasoning for more domain pruning during constraint filtering. Even though ITE is not competitive with specialized filtering algorithms available in some global constraints implementations, its expressiveness allows users to easily define well-tuned constraints with powerful deduction capabilities. Our extended experimental results show that ITE is more efficient than available generic approaches that handle logical constraint systems over finite domains.

AIMar 27, 2024
Probabilistic Model Checking of Stochastic Reinforcement Learning Policies

Dennis Gross, Helge Spieker

We introduce a method to verify stochastic reinforcement learning (RL) policies. This approach is compatible with any RL algorithm as long as the algorithm and its corresponding environment collectively adhere to the Markov property. In this setting, the future state of the environment should depend solely on its current state and the action executed, independent of any previous states or actions. Our method integrates a verification technique, referred to as model checking, with RL, leveraging a Markov decision process, a trained RL policy, and a probabilistic computation tree logic (PCTL) formula to build a formal model that can be subsequently verified via the model checker Storm. We demonstrate our method's applicability across multiple benchmarks, comparing it to baseline methods called deterministic safety estimates and naive monolithic model checking. Our results show that our method is suited to verify stochastic RL policies.

SEJul 13, 2025
Prompting for Performance: Exploring LLMs for Configuring Software

Helge Spieker, Théo Matricon, Nassim Belmecheri et al.

Software systems usually provide numerous configuration options that can affect performance metrics such as execution time, memory usage, binary size, or bitrate. On the one hand, making informed decisions is challenging and requires domain expertise in options and their combinations. On the other hand, machine learning techniques can search vast configuration spaces, but with a high computational cost, since concrete executions of numerous configurations are required. In this exploratory study, we investigate whether large language models (LLMs) can assist in performance-oriented software configuration through prompts. We evaluate several LLMs on tasks including identifying relevant options, ranking configurations, and recommending performant configurations across various configurable systems, such as compilers, video encoders, and SAT solvers. Our preliminary results reveal both positive abilities and notable limitations: depending on the task and systems, LLMs can well align with expert knowledge, whereas hallucinations or superficial reasoning can emerge in other cases. These findings represent a first step toward systematic evaluations and the design of LLM-based solutions to assist with software configuration.

AIMar 27, 2024
Enhancing Manufacturing Quality Prediction Models through the Integration of Explainability Methods

Dennis Gross, Helge Spieker, Arnaud Gotlieb et al.

This research presents a method that utilizes explainability techniques to amplify the performance of machine learning (ML) models in forecasting the quality of milling processes, as demonstrated in this paper through a manufacturing use case. The methodology entails the initial training of ML models, followed by a fine-tuning phase where irrelevant features identified through explainability methods are eliminated. This procedural refinement results in performance enhancements, paving the way for potential reductions in manufacturing costs and a better understanding of the trained ML models. This study highlights the usefulness of explainability techniques in both explaining and optimizing predictive models in the manufacturing realm.

ROApr 17, 2025
Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks

Nassim Belmecheri, Arnaud Gotlieb, Nadjib Lazaar et al.

This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs' effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM's human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.

AIJan 6, 2025
Co-Activation Graph Analysis of Safety-Verified and Explainable Deep Reinforcement Learning Policies

Dennis Gross, Helge Spieker

Deep reinforcement learning (RL) policies can demonstrate unsafe behaviors and are challenging to interpret. To address these challenges, we combine RL policy model checking--a technique for determining whether RL policies exhibit unsafe behaviors--with co-activation graph analysis--a method that maps neural network inner workings by analyzing neuron activation patterns--to gain insight into the safe RL policy's sequential decision-making. This combination lets us interpret the RL policy's inner workings for safe decision-making. We demonstrate its applicability in various experiments.

AIMar 25, 2024
Towards Trustworthy Automated Driving through Qualitative Scene Understanding and Explanations

Nassim Belmecheri, Arnaud Gotlieb, Nadjib Lazaar et al.

Understanding driving scenes and communicating automated vehicle decisions are key requirements for trustworthy automated driving. In this article, we introduce the Qualitative Explainable Graph (QXG), which is a unified symbolic and qualitative representation for scene understanding in urban mobility. The QXG enables interpreting an automated vehicle's environment using sensor data and machine learning models. It utilizes spatio-temporal graphs and qualitative constraints to extract scene semantics from raw sensor inputs, such as LiDAR and camera data, offering an interpretable scene model. A QXG can be incrementally constructed in real-time, making it a versatile tool for in-vehicle explanations across various sensor types. Our research showcases the potential of QXG, particularly in the context of automated driving, where it can rationalize decisions by linking the graph with observed actions. These explanations can serve diverse purposes, from informing passengers and alerting vulnerable road users to enabling post-hoc analysis of prior behaviors.

CVJan 29, 2024
Trustworthy Automated Driving through Qualitative Scene Understanding and Explanations

Nassim Belmecheri, Arnaud Gotlieb, Nadjib Lazaar et al.

We present the Qualitative Explainable Graph (QXG): a unified symbolic and qualitative representation for scene understanding in urban mobility. QXG enables the interpretation of an automated vehicle's environment using sensor data and machine learning models. It leverages spatio-temporal graphs and qualitative constraints to extract scene semantics from raw sensor inputs, such as LiDAR and camera data, offering an intelligible scene model. Crucially, QXG can be incrementally constructed in real-time, making it a versatile tool for in-vehicle explanations and real-time decision-making across various sensor types. Our research showcases the transformative potential of QXG, particularly in the context of automated driving, where it elucidates decision rationales by linking the graph with vehicle actions. These explanations serve diverse purposes, from informing passengers and alerting vulnerable road users (VRUs) to enabling post-analysis of prior behaviours.

LGFeb 1
Semi-supervised CAPP Transformer Learning via Pseudo-labeling

Dennis Gross, Helge Spieker, Arnaud Gotlieb et al.

High-level Computer-Aided Process Planning (CAPP) generates manufacturing process plans from part specifications. It suffers from limited dataset availability in industry, reducing model generalization. We propose a semi-supervised learning approach to improve transformer-based CAPP transformer models without manual labeling. An oracle, trained on available transformer behaviour data, filters correct predictions from unseen parts, which are then used for one-shot retraining. Experiments on small-scale datasets with simulated ground truth across the full data distribution show consistent accuracy gains over baselines, demonstrating the method's effectiveness in data-scarce manufacturing environments.

AISep 23, 2025
Bounded PCTL Model Checking of Large Language Model Outputs

Dennis Gross, Helge Spieker, Arnaud Gotlieb

In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of $α$-$k$-bounded text generation, narrowing the focus to the $α$ maximal cumulative probability on the top-$k$ tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top-$k$ tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold $α$ further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of $α$-$k$-bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.

LGSep 3, 2025
Rashomon in the Streets: Explanation Ambiguity in Scene Understanding

Helge Spieker, Jørn Eirik Betten, Arnaud Gotlieb et al.

Explainable AI (XAI) is essential for validating and trusting models in safety-critical applications like autonomous driving. However, the reliability of XAI is challenged by the Rashomon effect, where multiple, equally accurate models can offer divergent explanations for the same prediction. This paper provides the first empirical quantification of this effect for the task of action prediction in real-world driving scenes. Using Qualitative Explainable Graphs (QXGs) as a symbolic scene representation, we train Rashomon sets of two distinct model classes: interpretable, pair-based gradient boosting models and complex, graph-based Graph Neural Networks (GNNs). Using feature attribution methods, we measure the agreement of explanations both within and between these classes. Our results reveal significant explanation disagreement. Our findings suggest that explanation ambiguity is an inherent property of the problem, not just a modeling artifact.

SEFeb 24, 2022
Testing Deep Learning Models: A First Comparative Study of Multiple Testing Techniques

Mohit Kumar Ahuja, Arnaud Gotlieb, Helge Spieker

Deep Learning (DL) has revolutionized the capabilities of vision-based systems (VBS) in critical applications such as autonomous driving, robotic surgery, critical infrastructure surveillance, air and maritime traffic control, etc. By analyzing images, voice, videos, or any type of complex signals, DL has considerably increased the situation awareness of these systems. At the same time, while relying more and more on trained DL models, the reliability and robustness of VBS have been challenged and it has become crucial to test thoroughly these models to assess their capabilities and potential errors. To discover faults in DL models, existing software testing methods have been adapted and refined accordingly. In this article, we provide an overview of these software testing methods, namely differential, metamorphic, mutation, and combinatorial testing, as well as adversarial perturbation testing and review some challenges in their deployment for boosting perception systems used in VBS. We also provide a first experimental comparative study on a classical benchmark used in VBS and discuss its results.

AINov 4, 2021
Predictive Machine Learning of Objective Boundaries for Solving COPs

Helge Spieker, Arnaud Gotlieb

Solving Constraint Optimization Problems (COPs) can be dramatically simplified by boundary estimation, that is, providing tight boundaries of cost functions. By feeding a supervised Machine Learning (ML) model with data composed of known boundaries and extracted features of COPs, it is possible to train the model to estimate boundaries of a new COP instance. In this paper, we first give an overview of the existing body of knowledge on ML for Constraint Programming (CP) which learns from problem instances. Second, we introduce a boundary estimation framework that is applied as a tool to support a CP solver. Within this framework, different ML models are discussed and evaluated regarding their suitability for boundary estimation, and countermeasures to avoid unfeasible estimations that avoid the solver to find an optimal solution are shown. Third, we present an experimental study with distinct CP solvers on seven COPs. Our results show that near-optimal boundaries can be learned for these COPs with only little overhead. These estimated boundaries reduce the objective domain size by 60-88% and can help the solver to find near-optimal solutions early during search.

AIApr 24, 2021
Constraint-Guided Reinforcement Learning: Augmenting the Agent-Environment-Interaction

Helge Spieker

Reinforcement Learning (RL) agents have great successes in solving tasks with large observation and action spaces from limited feedback. Still, training the agents is data-intensive and there are no guarantees that the learned behavior is safe and does not violate rules of the environment, which has limitations for the practical deployment in real-world scenarios. This paper discusses the engineering of reliable agents via the integration of deep RL with constraint-based augmentation models to guide the RL agent towards safe behavior. Within the constraints set, the RL agent is free to adapt and explore, such that its effectiveness to solve the given problem is not hindered. However, once the RL agent leaves the space defined by the constraints, the outside models can provide guidance to still work reliably. We discuss integration points for constraint guidance within the RL process and perform experiments on two case studies: a strictly constrained card game and a grid world environment with additional combinatorial subgoals. Our results show that constraint-guidance does both provide reliability improvements and safer behavior, as well as accelerated training.

SENov 12, 2020
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Steffen Herbold, Alexander Trautsch, Benjamin Ledel et al.

Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.

SEJul 14, 2020
Opening the Software Engineering Toolbox for the Assessment of Trustworthy AI

Mohit Kumar Ahuja, Mohamed-Bachir Belaid, Pierre Bernabé et al.

Trustworthiness is a central requirement for the acceptance and success of human-centered artificial intelligence (AI). To deem an AI system as trustworthy, it is crucial to assess its behaviour and characteristics against a gold standard of Trustworthy AI, consisting of guidelines, requirements, or only expectations. While AI systems are highly complex, their implementations are still based on software. The software engineering community has a long-established toolbox for the assessment of software systems, especially in the context of software testing. In this paper, we argue for the application of software engineering and testing practices for the assessment of trustworthy AI. We make the connection between the seven key requirements as defined by the European Commission's AI high-level expert group and established procedures from software engineering and raise questions for future work.

AIJun 20, 2020
Learning Objective Boundaries for Constraint Optimization Problems

Helge Spieker, Arnaud Gotlieb

Constraint Optimization Problems (COP) are often considered without sufficient knowledge on the boundaries of the objective variable to optimize. When available, tight boundaries are helpful to prune the search space or estimate problem characteristics. Finding close boundaries, that correctly under- and overestimate the optimum, is almost impossible without actually solving the COP. This paper introduces Bion, a novel approach for boundary estimation by learning from previously solved instances of the COP. Based on supervised machine learning, Bion is problem-specific and solver-independent and can be applied to any COP which is repeatedly solved with different data inputs. An experimental evaluation over seven realistic COPs shows that an estimation model can be trained to prune the objective variables' domains by over 80%. By evaluating the estimated boundaries with various COP solvers, we find that Bion improves the solving process for some problems, although the effect of closer bounds is generally problem-dependent.

SEOct 1, 2019
Adaptive Metamorphic Testing with Contextual Bandits

Helge Spieker, Arnaud Gotlieb

Metamorphic Testing is a software testing paradigm which aims at using necessary properties of a system-under-test, called metamorphic relations, to either check its expected outputs, or to generate new test cases. Metamorphic Testing has been successful to test programs for which a full oracle is not available or to test programs for which there are uncertainties on expected outputs such as learning systems. In this article, we propose Adaptive Metamorphic Testing as a generalization of a simple yet powerful reinforcement learning technique, namely contextual bandits, to select one of the multiple metamorphic relations available for a program. By using contextual bandits, Adaptive Metamorphic Testing learns which metamorphic relations are likely to transform a source test case, such that it has higher chance to discover faults. We present experimental results over two major case studies in machine learning, namely image classification and object detection, and identify weaknesses and robustness boundaries. Adaptive Metamorphic Testing efficiently identifies weaknesses of the tested systems in context of the source test case.

SEFeb 12, 2019
Time-aware Test Case Execution Scheduling for Cyber-Physical Systems

Morten Mossige, Arnaud Gotlieb, Helge Spieker et al.

Testing cyber-physical systems involves the execution of test cases on target-machines equipped with the latest release of a software control system. When testing industrial robots, it is common that the target machines need to share some common resources, e.g., costly hardware devices, and so there is a need to schedule test case execution on the target machines, accounting for these shared resources. With a large number of such tests executed on a regular basis, this scheduling becomes difficult to manage manually. In fact, with manual test execution planning and scheduling, some robots may remain unoccupied for long periods of time and some test cases may not be executed. This paper introduces TC-Sched, a time-aware method for automated test case execution scheduling. TC-Sched uses Constraint Programming to schedule tests to run on multiple machines constrained by the tests' access to shared resources, such as measurement or networking devices. The CP model is written in SICStus Prolog and uses the Cumulatives global constraint. Given a set of test cases, a set of machines, and a set of shared resources, TC-Sched produces an execution schedule where each test is executed once with minimal time between when a source code change is committed and the test results are reported to the developer. Experiments reveal that TC-Sched can schedule 500 test cases over 100 machines in less than 4 minutes for 99.5% of the instances. In addition, TC-Sched largely outperforms simpler methods based on a greedy algorithm and is suitable for deployment on industrial robot testing.

MLJan 14, 2019
Towards Testing of Deep Learning Systems with Training Set Reduction

Helge Spieker, Arnaud Gotlieb

Testing the implementation of deep learning systems and their training routines is crucial to maintain a reliable code base. Modern software development employs processes, such as Continuous Integration, in which changes to the software are frequently integrated and tested. However, testing the training routines requires running them and fully training a deep learning model can be resource-intensive, when using the full data set. Using only a subset of the training data can improve test run time, but can also reduce its effectiveness. We evaluate different ways for training set reduction and their ability to mimic the characteristics of model training with the original full data set. Our results underline the usefulness of training set reduction, especially in resource-constrained environments.

SENov 9, 2018
Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration

Helge Spieker, Arnaud Gotlieb, Dusica Marijan et al.

Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.

AINov 8, 2018
Multi-Cycle Assignment Problems with Rotational Diversity

Helge Spieker, Arnaud Gotlieb, Morten Mossige

Multi-cycle assignment problems address scenarios where a series of general assignment problems has to be solved sequentially. Subsequent cycles can differ from previous ones due to changing availability or creation of tasks and agents, which makes an upfront static schedule infeasible and introduces uncertainty in the task-agent assignment process. We consider the setting where, besides profit maximization, it is also desired to maintain diverse assignments for tasks and agents, such that all tasks have been assigned to all agents over subsequent cycles. This problem of multi-cycle assignment with rotational diversity is approached in two sub-problems: The outer problem which augments the original profit maximization objective with additional information about the state of rotational diversity while the inner problem solves the adjusted general assignment problem in a single execution of the model. We discuss strategies to augment the profit values and evaluate them experimentally. The method's efficacy is shown in three case studies: multi-cycle variants of the multiple knapsack and the multiple subset sum problems, and a real-world case study on the test case selection and assignment problem from the software engineering domain.