SEApr 23, 2022
Industry-Academia Research Collaboration in Software Engineering: The Certus ModelDusica Marijan, Arnaud Gotlieb
Context: Research collaborations between software engineering industry and academia can provide significant benefits to both sides, including improved innovation capacity for industry, and real-world environment for motivating and validating research ideas. However, building scalable and effective research collaborations in software engineering is known to be challenging. While such challenges can be varied and many, in this paper we focus on the challenges of achieving participative knowledge creation supported by active dialog between industry and academia and continuous commitment to joint problem solving. Objective: This paper aims to understand what are the elements of a successful industry-academia collaboration that enable the culture of participative knowledge creation. Method: We conducted participant observation collecting qualitative data spanning 8 years of collaborative research between a software engineering research group on software V&V and the Norwegian IT sector. The collected data was analyzed and synthesized into a practical collaboration model, named the Certus Model. Results: The model is structured in seven phases, describing activities from setting up research projects to the exploitation of research results. As such, the Certus model advances other collaborations models from literature by delineating different phases covering the complete life cycle of participative research knowledge creation. Conclusion: The Certus model describes the elements of a research collaboration process between researchers and practitioners in software engineering, grounded on the principles of research knowledge co-creation and continuous commitment to joint problem solving. The model can be applied and tested in other contexts where it may be adapted to the local context through experimentation.
SEApr 30, 2022
Software Testing for Machine LearningDusica Marijan, Arnaud Gotlieb
Machine learning has become prevalent across a wide variety of applications. Unfortunately, machine learning has also shown to be susceptible to deception, leading to errors, and even fatal failures. This circumstance calls into question the widespread use of machine learning, especially in safety-critical applications, unless we are able to assure its correctness and trustworthiness properties. Software verification and testing are established technique for assuring such properties, for example by detecting errors. However, software testing challenges for machine learning are vast and profuse - yet critical to address. This summary talk discusses the current state-of-the-art of software testing for machine learning. More specifically, it discusses six key challenge areas for software testing of machine learning systems, examines current approaches to these challenges and highlights their limitations. The paper provides a research agenda with elaborated directions for making progress toward advancing the state-of-the-art on testing of machine learning.
LGOct 24, 2023
Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance Using Self-Supervised Deep LearningPierre Bernabé, Arnaud Gotlieb, Bruno Legeard et al.
In maritime traffic surveillance, detecting illegal activities, such as illegal fishing or transshipment of illicit products is a crucial task of the coastal administration. In the open sea, one has to rely on Automatic Identification System (AIS) message transmitted by on-board transponders, which are captured by surveillance satellites. However, insincere vessels often intentionally shut down their AIS transponders to hide illegal activities. In the open sea, it is very challenging to differentiate intentional AIS shutdowns from missing reception due to protocol limitations, bad weather conditions or restricting satellite positions. This paper presents a novel approach for the detection of abnormal AIS missing reception based on self-supervised deep learning techniques and transformer models. Using historical data, the trained model predicts if a message should be received in the upcoming minute or not. Afterwards, the model reports on detected anomalies by comparing the prediction with what actually happens. Our method can process AIS messages in real-time, in particular, more than 500 Millions AIS messages per month, corresponding to the trajectories of more than 60 000 ships. The method is evaluated on 1-year of real-world data coming from four Norwegian surveillance satellites. Using related research results, we validated our method by rediscovering already detected intentional AIS shutdowns.
SEApr 29, 2022
Industry-academia research collaboration and knowledge co-creation: Patterns and anti-patternsDusica Marijan, Sagar Sen
Increasing the impact of software engineering research in the software industry and the society at large has long been a concern of high priority for the software engineering community. The problem of two cultures, research conducted in a vacuum (disconnected from the real world), or misaligned time horizons are just some of the many complex challenges standing in the way of successful industry-academia collaborations. This paper reports on the experience of research collaboration and knowledge co-creation between industry and academia in software engineering as a way to bridge the research-practice collaboration gap. Our experience spans 14 years of collaboration between researchers in software engineering and the European and Norwegian software and IT industry. Using the participant observation and interview methods we have collected and afterwards analyzed an extensive record of qualitative data. Drawing upon the findings made and the experience gained, we provide a set of 14 patterns and 14 anti-patterns for industry-academia collaborations, aimed to support other researchers and practitioners in establishing and running research collaboration projects in software engineering.
SEApr 22, 2022
Comparative Study of Machine Learning Test Case Prioritization for Continuous Integration TestingDusica Marijan
There is a growing body of research indicating the potential of machine learning to tackle complex software testing challenges. One such challenge pertains to continuous integration testing, which is highly time-constrained, and generates a large amount of data coming from iterative code commits and test runs. In such a setting, we can use plentiful test data for training machine learning predictors to identify test cases able to speed up the detection of regression bugs introduced during code integration. However, different machine learning models can have different fault prediction performance depending on the context and the parameters of continuous integration testing, for example variable time budget available for continuous integration cycles, or the size of test execution history used for learning to prioritize failing test cases. Existing studies on test case prioritization rarely study both of these factors, which are essential for the continuous integration practice. In this study we perform a comprehensive comparison of the fault prediction performance of machine learning approaches that have shown the best performance on test case prioritization tasks in the literature. We evaluate the accuracy of the classifiers in predicting fault-detecting tests for different values of the continuous integration time budget and with different length of test history used for training the classifiers. In evaluation, we use real-world industrial datasets from a continuous integration practice. The results show that different machine learning models have different performance for different size of test history used for model training and for different time budget available for test case execution. Our results imply that machine learning approaches for test prioritization in continuous integration testing should be carefully configured to achieve optimal performance.
14.5CLApr 23
Cross-Domain Data Selection and Augmentation for Automatic Compliance DetectionFariz Ikhwantri, Dusica Marijan
Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.
LGApr 25, 2022
Discovering Gateway Ports in Maritime Using Temporal Graph Neural Network Port ClassificationDogan Altan, Mohammad Etemad, Dusica Marijan et al.
Vessel navigation is influenced by various factors, such as dynamic environmental factors that change over time or static features such as vessel type or depth of the ocean. These dynamic and static navigational factors impose limitations on vessels, such as long waiting times in regions outside the actual ports, and we call these waiting regions gateway ports. Identifying gateway ports and their associated features such as congestion and available utilities can enhance vessel navigation by planning on fuel optimization or saving time in cargo operation. In this paper, we propose a novel temporal graph neural network (TGNN) based port classification method to enable vessels to discover gateway ports efficiently, thus optimizing their operations. The proposed method processes vessel trajectory data to build dynamic graphs capturing spatio-temporal dependencies between a set of static and dynamic navigational features in the data, and it is evaluated in terms of port classification accuracy on a real-world data set collected from ten vessels operating in Halifax, NS, Canada. The experimental results indicate that our TGNN-based port classification method provides an f-score of 95% in classifying ports.
AIAug 28, 2023
ReMAV: Reward Modeling of Autonomous Vehicles for Finding Likely Failure EventsAizaz Sharif, Dusica Marijan
Autonomous vehicles are advanced driving systems that are well known to be vulnerable to various adversarial attacks, compromising vehicle safety and posing a risk to other road users. Rather than actively training complex adversaries by interacting with the environment, there is a need to first intelligently find and reduce the search space to only those states where autonomous vehicles are found to be less confident. In this paper, we propose a black-box testing framework ReMAV that uses offline trajectories first to analyze the existing behavior of autonomous vehicles and determine appropriate thresholds to find the probability of failure events. To this end, we introduce a three-step methodology which i) uses offline state action pairs of any autonomous vehicle under test, ii) builds an abstract behavior representation using our designed reward modeling technique to analyze states with uncertain driving decisions, and iii) uses a disturbance model for minimal perturbation attacks where the driving decisions are less confident. Our reward modeling technique helps in creating a behavior representation that allows us to highlight regions of likely uncertain behavior even when the standard autonomous vehicle performs well. We perform our experiments in a high-fidelity urban driving environment using three different driving scenarios containing single- and multi-agent interactions. Our experiment shows an increase in 35, 23, 48, and 50% in the occurrences of vehicle collision, road object collision, pedestrian collision, and offroad steering events, respectively by the autonomous vehicle under test, demonstrating a significant increase in failure events. We compare ReMAV with two baselines and show that ReMAV demonstrates significantly better effectiveness in generating failure events compared to the baselines in all evaluation metrics.
LGFeb 25
Physics-Informed Machine Learning for Vessel Shaft Power and Fuel Consumption Prediction: Interpretable KAN-based ApproachHamza Haruna Mohammed, Dusica Marijan, Arnbjørn Maressa
Accurate prediction of shaft rotational speed, shaft power, and fuel consumption is crucial for enhancing operational efficiency and sustainability in maritime transportation. Conventional physics-based models provide interpretability but struggle with real-world variability, while purely data-driven approaches achieve accuracy at the expense of physical plausibility. This paper introduces a Physics-Informed Kolmogorov-Arnold Network (PI-KAN), a hybrid method that integrates interpretable univariate feature transformations with a physics-informed loss function and a leakage-free chained prediction pipeline. Using operational and environmental data from five cargo vessels, PI-KAN consistently outperforms the traditional polynomial method and neural network baselines. The model achieves the lowest mean absolute error (MAE) and root mean squared error (RMSE), and the highest coefficient of determination (R^2) for shaft power and fuel consumption across all vessels, while maintaining physically consistent behavior. Interpretability analysis reveals rediscovery of domain-consistent dependencies, such as cubic-like speed-power relationships and cosine-like wave and wind effects. These results demonstrate that PI-KAN achieves both predictive accuracy and interpretability, offering a robust tool for vessel performance monitoring and decision support in operational settings.
LGFeb 25
Estimation and Optimization of Ship Fuel Consumption in Maritime: Review, Challenges and Future DirectionsDusica Marijan, Hamza Haruna Mohammed, Bakht Zaman
To reduce carbon emissions and minimize shipping costs, improving the fuel efficiency of ships is crucial. Various measures are taken to reduce the total fuel consumption of ships, including optimizing vessel parameters and selecting routes with the lowest fuel consumption. Different estimation methods are proposed for predicting fuel consumption, while various optimization methods are proposed to minimize fuel oil consumption. This paper provides a comprehensive review of methods for estimating and optimizing fuel oil consumption in maritime transport. Our novel contributions include categorizing fuel oil consumption \& estimation methods into physics-based, machine-learning, and hybrid models, exploring their strengths and limitations. Furthermore, we highlight the importance of data fusion techniques, which combine AIS, onboard sensors, and meteorological data to enhance accuracy. We make the first attempt to discuss the emerging role of Explainable AI in enhancing model transparency for decision-making. Uniquely, key challenges, including data quality, availability, and the need for real-time optimization, are identified, and future research directions are proposed to address these gaps, with a focus on hybrid models, real-time optimization, and the standardization of datasets.
LGDec 23, 2025
Physics-guided Neural Network-based Shaft Power Prediction for VesselsDogan Altan, Hamza Haruna Mohammed, Glenn Terje Lines et al.
Optimizing maritime operations, particularly fuel consumption for vessels, is crucial, considering its significant share in global trade. As fuel consumption is closely related to the shaft power of a vessel, predicting shaft power accurately is a crucial problem that requires careful consideration to minimize costs and emissions. Traditional approaches, which incorporate empirical formulas, often struggle to model dynamic conditions, such as sea conditions or fouling on vessels. In this paper, we present a hybrid, physics-guided neural network-based approach that utilizes empirical formulas within the network to combine the advantages of both neural networks and traditional techniques. We evaluate the presented method using data obtained from four similar-sized cargo vessels and compare the results with those of a baseline neural network and a traditional approach that employs empirical formulas. The experimental results demonstrate that the physics-guided neural network approach achieves lower mean absolute error, root mean square error, and mean absolute percentage error for all tested vessels compared to both the empirical formula-based method and the base neural network.
LGAug 21, 2023
Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network ModelsPreben M. Ness, Dusica Marijan, Sunanda Bose
Causal Neural Network models have shown high levels of robustness to adversarial attacks as well as an increased capacity for generalisation tasks such as few-shot learning and rare-context classification compared to traditional Neural Networks. This robustness is argued to stem from the disentanglement of causal and confounder input signals. However, no quantitative study has yet measured the level of disentanglement achieved by these types of causal models or assessed how this relates to their adversarial robustness. Existing causal disentanglement metrics are not applicable to deterministic models trained on real-world datasets. We, therefore, utilise metrics of content/style disentanglement from the field of Computer Vision to measure different aspects of the causal disentanglement for four state-of-the-art causal Neural Network models. By re-implementing these models with a common ResNet18 architecture we are able to fairly measure their adversarial robustness on three standard image classification benchmarking datasets under seven common white-box attacks. We find a strong association (r=0.820, p=0.001) between the degree to which models decorrelate causal and confounder signals and their adversarial robustness. Additionally, we find a moderate negative association between the pixel-level information content of the confounder signal and adversarial robustness (r=-0.597, p=0.040).
AINov 9, 2018Code
ITE: A Lightweight Implementation of Stratified Reasoning for Constructive Logical OperatorsArnaud Gotlieb, Dusica Marijan, Helge Spieker
Constraint Programming (CP) is a powerful declarative programming paradigm where inference and search are interleaved to find feasible and optimal solutions to various type of constraint systems. However, handling logical connectors with constructive information in CP is notoriously difficult. This paper presents If Then Else (ITE), a lightweight implementation of stratified constructive reasoning for logical connectives. Stratification is introduced to cope with the risk of combinatorial explosion of constructing information from nested and combined logical operators. ITE is an open-source library built on top of SICStus Prolog clp(fd), which proposes various operators, including constructive disjunction and negation, constructive implication and conditional. These operators can be used to express global constraints and to benefit from constructive reasoning for more domain pruning during constraint filtering. Even though ITE is not competitive with specialized filtering algorithms available in some global constraints implementations, its expressiveness allows users to easily define well-tuned constraints with powerful deduction capabilities. Our extended experimental results show that ITE is more efficient than available generic approaches that handle logical constraint systems over finite domains.
19.4SEApr 22
Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance AnalysisFariz Ikhwantri, Dusica Marijan
An assurance case is a structured argument document that justifies claims about a system's requirements or properties, which are supported by evidence. In regulated domains, these are crucial for meeting compliance and safety requirements to industry standards. We propose a graph diagnostic framework for analysing the structure and provenance of assurance cases. We focus on two main tasks: (1) link prediction, to learn and identify connections between argument elements, and (2) graph classification, to differentiate between assurance cases created by a state-of-the-art large language model and those created by humans, aiming to detect bias. We compiled a publicly available dataset of assurance cases, represented as graphs with nodes and edges, supporting both link prediction and provenance analysis. Experiments show that graph neural networks (GNNs) achieve strong link prediction performance (ROC-AUC 0.760) on real assurance cases and generalise well across domains and semi-supervised settings. For provenance detection, GNNs effectively distinguish human-authored from LLM-generated cases (F1 0.94). We observed that LLM-generated assurance cases have different hierarchical linking patterns compared to human-authored cases. Furthermore, existing GNN explanation methods show only moderate faithfulness, revealing a gap between predicted reasoning and the true argument structure.
LGOct 3, 2025
From high-frequency sensors to noon reports: Using transfer learning for shaft power prediction in maritimeAkriti Sharma, Dogan Altan, Dusica Marijan et al.
With the growth of global maritime transportation, energy optimization has become crucial for reducing costs and ensuring operational efficiency. Shaft power is the mechanical power transmitted from the engine to the shaft and directly impacts fuel consumption, making its accurate prediction a paramount step in optimizing vessel performance. Power consumption is highly correlated with ship parameters such as speed and shaft rotation per minute, as well as weather and sea conditions. Frequent access to this operational data can improve prediction accuracy. However, obtaining high-quality sensor data is often infeasible and costly, making alternative sources such as noon reports a viable option. In this paper, we propose a transfer learning-based approach for predicting vessels shaft power, where a model is initially trained on high-frequency data from a vessel and then fine-tuned with low-frequency daily noon reports from other vessels. We tested our approach on sister vessels (identical dimensions and configurations), a similar vessel (slightly larger with a different engine), and a different vessel (distinct dimensions and configurations). The experiments showed that the mean absolute percentage error decreased by 10.6 percent for sister vessels, 3.6 percent for a similar vessel, and 5.3 percent for a different vessel, compared to the model trained solely on noon report data.
CLJun 10, 2025
Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case StructureFariz Ikhwantri, Dusica Marijan
Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
AIDec 22, 2021
Evaluating the Robustness of Deep Reinforcement Learning for Autonomous Policies in a Multi-agent Urban Driving EnvironmentAizaz Sharif, Dusica Marijan
Deep reinforcement learning is actively used for training autonomous car policies in a simulated driving environment. Due to the large availability of various reinforcement learning algorithms and the lack of their systematic comparison across different driving scenarios, we are unsure of which ones are more effective for training autonomous car software in single-agent as well as multi-agent driving environments. A benchmarking framework for the comparison of deep reinforcement learning in a vision-based autonomous driving will open up the possibilities for training better autonomous car driving policies. To address these challenges, we provide an open and reusable benchmarking framework for systematic evaluation and comparative analysis of deep reinforcement learning algorithms for autonomous driving in a single- and multi-agent environment. Using the framework, we perform a comparative study of discrete and continuous action space deep reinforcement learning algorithms. We also propose a comprehensive multi-objective reward function designed for the evaluation of deep reinforcement learning-based autonomous driving agents. We run the experiments in a vision-only high-fidelity urban driving simulated environments. The results indicate that only some of the deep reinforcement learning algorithms perform consistently better across single and multi-agent scenarios when trained in various multi-agent-only environment settings. For example, A3C- and TD3-based autonomous cars perform comparatively better in terms of more robust actions and minimal driving errors in both single and multi-agent scenarios. We conclude that different deep reinforcement learning algorithms exhibit different driving and testing performance in different scenarios, which underlines the need for their systematic comparative analysis. The benchmarking framework proposed in this paper facilitates such a comparison.
AIDec 22, 2021
Adversarial Deep Reinforcement Learning for Improving the Robustness of Multi-agent Autonomous Driving PoliciesAizaz Sharif, Dusica Marijan
Autonomous cars are well known for being vulnerable to adversarial attacks that can compromise the safety of the car and pose danger to other road users. To effectively defend against adversaries, it is required to not only test autonomous cars for finding driving errors but to improve the robustness of the cars to these errors. To this end, in this paper, we propose a two-step methodology for autonomous cars that consists of (i) finding failure states in autonomous cars by training the adversarial driving agent, and (ii) improving the robustness of autonomous cars by retraining them with effective adversarial inputs. Our methodology supports testing autonomous cars in a multi-agent environment, where we train and compare adversarial car policy on two custom reward functions to test the driving control decision of autonomous cars. We run experiments in a vision-based high-fidelity urban driving simulated environment. Our results show that adversarial testing can be used for finding erroneous autonomous driving behavior, followed by adversarial training for improving the robustness of deep reinforcement learning-based autonomous driving policies. We demonstrate that the autonomous cars retrained using the effective adversarial inputs noticeably increase the performance of their driving policies in terms of reduced collision and offroad steering errors.
SEOct 14, 2021
DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration TestingAizaz Sharif, Dusica Marijan, Marius Liaaen
Continuous integration testing is an important step in the modern software engineering life cycle. Test prioritization is a method that can improve the efficiency of continuous integration testing by selecting test cases that can detect faults in the early stage of each cycle. As continuous integration testing produces voluminous test execution data, test history is a commonly used artifact in test prioritization. However, existing test prioritization techniques for continuous integration either cannot handle large test history or are optimized for using a limited number of historical test cycles. We show that such a limitation can decrease fault detection effectiveness of prioritized test suites. This work introduces DeepOrder, a deep learning-based model that works on the basis of regression machine learning. DeepOrder ranks test cases based on the historical record of test executions from any number of previous test cycles. DeepOrder learns failed test cases based on multiple factors including the duration and execution status of test cases. We experimentally show that deep neural networks, as a simple regression model, can be efficiently used for test case prioritization in continuous integration testing. DeepOrder is evaluated with respect to time-effectiveness and fault detection effectiveness in comparison with an industry practice and the state of the art approaches. The results show that DeepOrder outperforms the industry practice and state-of-the-art test prioritization approaches in terms of these two metrics.
SEMar 18, 2021
Blockchain Testing: Challenges, Techniques, and Research DirectionsChhagan Lal, Dusica Marijan
Specific testing solutions targeting blockchain-based software are gaining huge attention as blockchain technologies are being increasingly incorporated into enterprise systems. As blockchain-based software enters production systems, it is paramount to follow proper engineering practices, ensure the required level of testing, and assess the readiness of the developed system. The existing research aims at addressing the testing-related issues and challenges of engineering blockchain-based software by providing suitable techniques and tools. However, like any emerging discipline, the best practices and tools for testing blockchain-based systems are not yet sufficiently developed. In this paper, we provide a comprehensive survey on the testing of Blockchain-based Applications (BC-Apps). First, we provide a discussion on identified challenges that are associated with BCApp testing. Second, we use a layered approach to discuss the state-of-the-art testing efforts in the area of BC technologies. In particular, we present an overview of the existing testing tools and techniques that provide testing solutions either for different components at various layers of the BC-App stack or across the whole stack. Third, we provide a set of future research directions based on the identified BC testing challenges and gaps in the literature review of existing testing solutions for BC-Apps. Moreover, we reflect on the specificity of BC-based software development procedure, which makes some of the existing tools or techniques inadequate, and call for the definition of standardised testing procedures and techniques for BC-Apps. The aim of our study is to highlight the importance of BC-based software testing and to pave the way for disciplined, testable, and verifiable BC software development.
SEJul 14, 2020
Opening the Software Engineering Toolbox for the Assessment of Trustworthy AIMohit Kumar Ahuja, Mohamed-Bachir Belaid, Pierre Bernabé et al.
Trustworthiness is a central requirement for the acceptance and success of human-centered artificial intelligence (AI). To deem an AI system as trustworthy, it is crucial to assess its behaviour and characteristics against a gold standard of Trustworthy AI, consisting of guidelines, requirements, or only expectations. While AI systems are highly complex, their implementations are still based on software. The software engineering community has a long-established toolbox for the assessment of software systems, especially in the context of software testing. In this paper, we argue for the application of software engineering and testing practices for the assessment of trustworthy AI. We make the connection between the seven key requirements as defined by the European Commission's AI high-level expert group and established procedures from software engineering and raise questions for future work.
SENov 9, 2018
Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous IntegrationHelge Spieker, Arnaud Gotlieb, Dusica Marijan et al.
Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.