42.6SEMar 10Code
Declarative Scenario-based Testing with RoadLogicEzio Bartocci, Alessio Gambi, Felix Gigler et al.
Scenario-based testing is a key method for cost-effective and safe validation of autonomous vehicles (AVs). Existing approaches rely on imperative scenario definitions, requiring developers to manually enumerate numerous variants to achieve coverage. Declarative languages, such as OpenSCENARIO DSL (OS2), raise the abstraction level but lack systematic methods for instantiating concrete, specification-compliant scenarios as simulations. To our knowledge, currently, no open-source solution provides this capability. We present RoadLogic that bridges declarative OS2 specifications and executable simulations. It uses Answer Set Programming to generate abstract plans satisfying scenario constraints, motion planning to refine the plans into feasible trajectories, and specification-based monitoring to verify correctness. We evaluate RoadLogic on instantiating representative OS2 scenarios as simulations in the CommonRoad framework. Results show that RoadLogic consistently produces realistic, specification-satisfying simulations within minutes and captures diverse behavioral variants through parameter sampling, thus opening the door to systematic scenario-based testing for autonomous driving systems.
ROJun 13, 2025Code
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario AnalysisYuan Gao, Mattia Piccinini, Yuchen Zhang et al.
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.
SENov 8, 2021
Machine Learning-based Test Selection for Simulation-based Testing of Self-driving Cars SoftwareSajad Khatiri, Christian Birchler, Bill Bosshard et al.
Abstract Simulation platforms facilitate the development of emerging cyber-physical systems (CPS) like self-driving cars (SDC) because they are more efficient and less dangerous than field operational tests. Despite this, thoroughly testing SDCs in simulated environments remains challenging because SDCs must be tested in a sheer amount of long-running test scenarios. Past results on software testing optimization have shown that not all the tests contribute equally to establishing confidence in test subjects' quality and reliability, with some \uninformative" tests that can be skipped (or removed) to reduce testing effort. However, this problem was partially addressed in the context of SDC simulation platforms. In this paper, we investigate test selection strategies to increase the cost-effectiveness of simulation-based testing in the context of SDCs. We propose an approach called SDC-Scissor (SDC coSt-effeCtIve teSt SelectOR), which leverages machine learning (ML) strategies to identify and skip tests that are unlikely to detect faults in SDCs before executing them. Specifically, SDC-Scissor extract features concerning the characteristics of the test scenarios being executed in the simulation environment and via ML strategies predict tests that lead to faults before executing them. Our evaluation shows that SDC-Scissor achieved high classification accuracy (up to 93.4%) in classifying tests leading to a fault which allows improving testing cost-effectiveness: SDC-Scissor was able to reduce (ca. 170%) the time spent in running irrelevant tests as well as identified 33% more failure triggering tests compared to a randomized baseline. Interestingly, SDC-Scissor does not introduce significant computational overhead in the SDCs testing process, which is critical to SDC development in industrial settings.
LGJul 5, 2021
DeepHyperion: Exploring the Feature Space of Deep Learning-Based Systems through Illumination SearchTahereh Zohdinasab, Vincenzo Riccio, Alessio Gambi et al.
Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour. In this paper, we resort to Illumination Search to find the highest-performing test cases (i.e., misbehaving and closest to misbehaving), spread across the cells of a map representing the feature space of the system. We introduce a methodology that guides the users of our approach in the tasks of identifying and quantifying the dimensions of the feature space for a given domain. We developed DeepHyperion, a search-based tool for DL systems that illuminates, i.e., explores at large, the feature space, by providing developers with an interpretable feature map where automatically generated inputs are placed along with information about the exposed behaviours.
SEJun 14, 2021
JUGE: An Infrastructure for Benchmarking Java Unit Test GeneratorsXavier Devroey, Alessio Gambi, Juan Pablo Galeotti et al.
Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and for various platforms (e.g., desktop, web, or mobile applications). Such generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries vs. system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved through the systematic execution of large-scale evaluations of different generators. However, the execution of such empirical evaluations is not trivial and requires a substantial effort to collect benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this paper, we present our JUnit Generation benchmarking infrastructure (JUGE) supporting generators (e.g., search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (e.g., validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, eight editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place and used and updated JUGE. As a result, an increasing amount of tools (over ten) from both academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions.