Fabrizio Pastore

SE
h-index5
20papers
301citations
Novelty44%
AI Score48

20 Papers

SEApr 1, 2022
Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems

Hazem Fahmy, Fabrizio Pastore, Lionel Briand et al.

When Deep Neural Networks (DNNs) are used in safety-critical systems, engineers should determine the safety risks associated with failures (i.e., erroneous outputs) observed during testing. For DNNs processing images, engineers visually inspect all failure-inducing images to determine common characteristics among them. Such characteristics correspond to hazard-triggering events (e.g., low illumination) that are essential inputs for safety analysis. Though informative, such activity is expensive and error-prone. To support such safety analysis practices, we propose SEDE, a technique that generates readable descriptions for commonalities in failure-inducing, real-world images and improves the DNN through effective retraining. SEDE leverages the availability of simulators, which are commonly used for cyber-physical systems. It relies on genetic algorithms to drive simulators towards the generation of images that are similar to failure-inducing, real-world images in the test set; it then employs rule learning algorithms to derive expressions that capture commonalities in terms of simulator parameter values. The derived expressions are then used to generate additional images to retrain and improve the DNN. With DNNs performing in-car sensing tasks, SEDE successfully characterized hazard-triggering events leading to a DNN accuracy drop. Also, SEDE enabled retraining leading to significant improvements in DNN accuracy, up to 18 percentage points.

SEOct 15, 2022
HUDD: A tool to debug DNNs for safety analysis

Hazem Fahmy, Fabrizio Pastore, Lionel Briand

We present HUDD, a tool that supports safety analysis practices for systems enabled by Deep Neural Networks (DNNs) by automatically identifying the root causes for DNN errors and retraining the DNN. HUDD stands for Heatmap-based Unsupervised Debugging of DNNs, it automatically clusters error-inducing images whose results are due to common subsets of DNN neurons. The intent is for the generated clusters to group error-inducing images having common characteristics, that is, having a common root cause. HUDD identifies root causes by applying a clustering algorithm to matrices (i.e., heatmaps) capturing the relevance of every DNN neuron on the DNN outcome. Also, HUDD retrains DNNs with images that are automatically selected based on their relatedness to the identified image clusters. Our empirical evaluation with DNNs from the automotive domain have shown that HUDD automatically identifies all the distinct root causes of DNN errors, thus supporting safety analysis. Also, our retraining approach has shown to be more effective at improving DNN accuracy than existing approaches. A demo video of HUDD is available at https://youtu.be/drjVakP7jdU.

SEJan 31, 2023
Supporting Safety Analysis of Image-processing DNNs through Clustering-based Approaches

Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore et al.

The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work. In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.

SEMay 8Code
Can LLMs Solve Science or Just Write Code? Evaluating Quantum Solver Generation

Luciano Baresi, Domenico Bianculli, Maryse Ernzer et al.

Large Language Models (LLMs) show strong capabilities in code generation, motivating their use in automated quantum solver development. However, in quantum computing, successful execution of generated code is not sufficient: correctness depends on numerically accurate results, which are sensitive to non-trivial mappings, hybrid quantum-classical workflows, and algorithm-specific approximations. This work introduces Q-SAGE, an iterative methodology to evaluate LLMs' capability in generating quantum solvers for scientific problems. The methodology adopts an iterative approach by executing the script generated by the LLM, comparing the result with the result of a classical solver, and refining the script until the two results match within a tolerance threshold. We empirically evaluated the methodology with five families of scientific problems of different complexities and five LLMs, both open source and proprietary. The results show that iterative refinement substantially improves success rates, but introduces a significant computational overhead. Moreover, as model capability increases, failure modes shift from execution errors to numerical inaccuracies, highlighting the current limitations of LLM-based quantum software.

SEDec 11, 2019Code
Metamorphic Security Testing for Web Systems

Phu X. Mai, Fabrizio Pastore, Arda Goknil et al.

Security testing verifies that the data and the resources of software systems are protected from attackers. Unfortunately, it suffers from the oracle problem, which refers to the challenge, given an input for a system, of distinguishing correct from incorrect behavior. In many situations where potential vulnerabilities are tested, a test oracle may not exist, or it might be impractical due to the many inputs for which specific oracles have to be defined. In this paper, we propose a metamorphic testing approach that alleviates the oracle problem in security testing. It enables engineers to specify metamorphic relations (MRs) that capture security properties of the system. Such MRs are then used to automate testing and detect vulnerabilities. We provide a catalog of 22 system-agnostic MRs to automate security testing in Web systems. Our approach targets 39% of the OWASP security testing activities not automated by state-of-the-art techniques. It automatically detected 10 out of 12 vulnerabilities affecting two widely used systems, one commercial and the other open source (Jenkins).

SEMay 5
Beyond Rules: LLM-Powered Linting for Quantum Programs

Pietro Cassieri, Giuseppe Scanniello, Seung Yeob Shin et al.

As quantum computing transitions from theoretical experimentation to its practical application, the reliability of quantum software has become a critical bottleneck. Traditional static analysis techniques for quantum programs, primarily rule-based linters, are increasingly inadequate; they struggle to keep pace with rapidly evolving APIs and fail to capture complex, context-dependent quantum programming problems. This results in high maintenance overhead and limited detection capabilities. In this paper, we introduce LintQ-LLM+CoT and LintQ-LLM+RAG, novel approaches that redefine the detection of quantum programming problems by employing Large Language Models (LLMs) specialized, respectively, via Chain-of-Thought (CoT) prompting and a Retrieval-Augmented Generation (RAG) system that grounds the model's reasoning in a curated knowledge base of verified quantum programming problems and best practices. We conducted a rigorous and manual comparative evaluation against the state-of-the-art rule-based tool, LintQ, using a corpus of 55 Qiskit programs. Our results show that LLM-based approaches, with and without RAG, outperform LintQ in terms of quantum programming problems detection correctness (precision) and completeness (recall). Overall, LLM-based approaches were more effective than LintQ (F1-score equal to 0.70 and 0.68 vs. 0.41). Furthermore, the RAG-enhanced variant demonstrated a slightly superior precision, effectively reducing false positives. Our findings suggest that LLMs provide a scalable and adaptive foundation for the next generation of linters in quantum software engineering.

SEMay 5
Randomized and Diverse Input State Generation for Quantum Program Testing

Maryse Ernzer, Seung Yeob Shin, Fabrizio Pastore et al.

With the accelerating development of quantum technologies and their growing computational potential, quantum systems are being adapted for simulations and other critical tasks across diverse domains, making the reliability of the corresponding quantum software an essential concern. Although recent efforts have started to incorporate quantum-specific properties such as magnitude, phase, and entanglement under the form of input-coverage criteria into software testing, the unique structure of the quantum state space demands for more comprehensive testing. In particular, the notion of complete state-space exploration has so far received little attention. To address this gap, we propose a framework for evaluating test circuit generators with respect to their coverage of the quantum state space. Our contribution is threefold: we develop a set of diversity scores that capture both local and global indicators of the extent to which the state space is explored; we propose a test circuit generator that produces test input states via a Brick-Circuit (BC) construction designed to approximate ideal random states using hardware-compatible gates; we compare the proposed construction with existing generators based on their ability to generate uniformly distributed random test input states. Our extended diversity scores quantify the local correlations and global spread of magnitude, phase and entanglement. Using these scores, we evaluate the expressibility, defined as the capability to span the quantum state space uniformly, and entangling capabilities of existing generators relative to the BC generator. Our results show that the hardware-compatible BC generator achieves higher expressibility and entanglement performance at shallower depths than existing circuit generators.

SEMar 20, 2025
GAN-enhanced Simulation-driven DNN Testing in Absence of Ground Truth

Mohammed Attaoui, Fabrizio Pastore

The generation of synthetic inputs via simulators driven by search algorithms is essential for cost-effective testing of Deep Neural Network (DNN) components for safety-critical systems. However, in many applications, simulators are unable to produce the ground-truth data needed for automated test oracles and to guide the search process. To tackle this issue, we propose an approach for the generation of inputs for computer vision DNNs that integrates a generative network to ensure simulator fidelity and employs heuristic-based search fitnesses that leverage transformation consistency, noise resistance, surprise adequacy, and uncertainty estimation. We compare the performance of our fitnesses with that of a traditional fitness function leveraging ground truth; further, we assess how the integration of a GAN not leveraging the ground truth impacts on test and retraining effectiveness. Our results suggest that leveraging transformation consistency is the best option to generate inputs for both DNN testing and retraining; it maximizes input diversity, spots the inputs leading to worse DNN performance, and leads to best DNN performance after retraining. Besides enabling simulator-based testing in the absence of ground truth, our findings pave the way for testing solutions that replace costly simulators with diffusion and large language models, which might be more affordable than simulators, but cannot generate ground-truth data.

SEJan 25, 2022
Data-driven Mutation Analysis for Cyber-Physical Systems

Enrico Viganò, Oscar Cornejo, Fabrizio Pastore et al.

Cyber-physical systems (CPSs) typically consist of a wide set of integrated, heterogeneous components; consequently, most of their critical failures relate to the interoperability of such components.Unfortunately, most CPS test automation techniques are preliminary and industry still heavily relies on manual testing. With potentially incomplete, manually-generated test suites, it is of paramount importance to assess their quality. Though mutation analysis has demonstrated to be an effective means to assess test suite quality in some specific contexts, we lack approaches for CPSs. Indeed, existing approaches do not target interoperability problems and cannot be executed in the presence of black-box or simulated components, a typical situation with CPSs. In this paper, we introduce data-driven mutation analysis, an approach that consists in assessing test suite quality by verifying if it detects interoperability faults simulated by mutating the data exchanged by software components. To this end, we describe a data-driven mutation analysis technique (DaMAT) that automatically alters the data exchanged through data buffers. Our technique is driven by fault models in tabular form where engineers specify how to mutate data items by selecting and configuring a set of mutation operators. We have evaluated DaMAT with CPSs in the space domain; specifically, the test suites for the software systems of a microsatellite and nanosatellites launched on orbit last year. Our results show that the approach effectively detects test suite shortcomings, is not affected by equivalent and redundant mutants, and entails acceptable costs.

SEJan 13, 2022
Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering

Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore et al.

Deep neural networks (DNNs) have demonstrated superior performance over classical machine learning to support many features in safety-critical systems. Although DNNs are now widely used in such systems (e.g., self driving cars), there is limited progress regarding automated support for functional safety analysis in DNN-based systems. For example, the identification of root causes of errors, to enable both risk analysis and DNN retraining, remains an open problem. In this paper, we propose SAFE, a black-box approach to automatically characterize the root causes of DNN errors. SAFE relies on a transfer learning model pre-trained on ImageNet to extract the features from error-inducing images. It then applies a density-based clustering algorithm to detect arbitrary shaped clusters of images modeling plausible causes of error. Last, clusters are used to effectively retrain and improve the DNN. The black-box nature of SAFE is motivated by our objective not to require changes or even access to the DNN internals to facilitate adoption. Experimental results show the superior ability of SAFE in identifying different root causes of DNN errors based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining, while saving significant execution time and memory when compared to alternatives.

SEJan 13, 2021
Mutation Analysis for Cyber-Physical Systems: Scalable Solutions and Results in the Space Domain

Oscar Cornejo, Fabrizio Pastore, Lionel Briand

On-board embedded software developed for spaceflight systems (space software) must adhere to stringent software quality assurance procedures. For example, verification and validation activities are typically performed and assessed by third party organizations. To further minimize the risk of human mistakes, space agencies, such as the European Space Agency (ESA), are looking for automated solutions for the assessment of software testing activities, which play a crucial role in this context. Over the years, mutation analysis has shown to be a promising solution for the automated assessment of test suites; it consists of measuring the quality of a test suite in terms of the percentage of injected faults leading to a test failure. A number of optimization techniques, addressing scalability and accuracy problems, have been proposed to facilitate the industrial adoption of mutation analysis. However, to date, two major problems prevent space agencies from enforcing mutation analysis in space software development. In this paper, we enhance mutation analysis optimization techniques to enable their applicability to embedded software and propose a pipeline that successfully integrates them to address scalability and accuracy issues in this context, as described above. Further, we report on the largest study involving embedded software systems in the mutation analysis literature. Our research is part of a research project funded by ESA ESTEC involving private companies (GomSpace Luxembourg and LuxSpace) in the space sector. These industry partners provided the case studies reported in this paper; they include an on-board software system managing a microsatellite currently on-orbit, a set of libraries used in deployed cubesats, and a mathematical library certified by ESA.

SEDec 4, 2020
Automated, Cost-effective, and Update-driven App Testing

Chanh Duc Ngo, Fabrizio Pastore, Lionel Briand

Apps' pervasive role in our society led to the definition of test automation approaches to ensure their dependability. However, state-of-the-art approaches tend to generate large numbers of test inputs and are unlikely to achieve more than 50% method coverage. In this paper, we propose a strategy to achieve significantly higher coverage of the code affected by updates with a much smaller number of test inputs, thus alleviating the test oracle problem. More specifically, we present ATUA, a model-based approach that synthesizes App models with static analysis, integrates a dynamically-refined state abstraction function and combines complementary testing strategies, including (1) coverage of the model structure, (2) coverage of the App code, (3) random exploration, and (4) coverage of dependencies identified through information retrieval. Its model-based strategy enables ATUA to generate a small set of inputs that exercise only the code affected by the updates. In turn, this makes common test oracle solutions more cost-effective as they tend to involve human effort. A large empirical evaluation, conducted with 72 App versions belonging to nine popular Android Apps, has shown that ATUA is more effective and less effort intensive than state-of-the-art approaches when testing App updates.

SEFeb 3, 2020
Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning

Hazem Fahmy, Fabrizio Pastore, Mojtaba Bagherzadeh et al.

Deep neural networks (DNNs) are increasingly important in safety-critical systems, for example in their perception layer to analyze images. Unfortunately, there is a lack of methods to ensure the functional safety of DNN-based components. We observe three major challenges with existing practices regarding DNNs in safety-critical systems: (1) scenarios that are underrepresented in the test set may lead to serious safety violation risks, but may, however, remain unnoticed; (2) characterizing such high-risk scenarios is critical for safety analysis; (3) retraining DNNs to address these risks is poorly supported when causes of violations are difficult to determine. To address these problems in the context of DNNs analyzing images, we propose HUDD, an approach that automatically supports the identification of root causes for DNN errors. HUDD identifies root causes by applying a clustering algorithm to heatmaps capturing the relevance of every DNN neuron on the DNN outcome. Also, HUDD retrains DNNs with images that are automatically selected based on their relatedness to the identified image clusters. We evaluated HUDD with DNNs from the automotive domain. HUDD was able to identify all the distinct root causes of DNN errors, thus supporting safety analysis. Also, our retraining approach has shown to be more effective at improving DNN accuracy than existing approaches.

SEJul 19, 2019
Automatic Generation of Acceptance Test Cases from Use Case Specifications: an NLP-based Approach

Chunhui Wang, Fabrizio Pastore, Arda Goknil et al.

Acceptance testing is a validation activity performed to ensure the conformance of software systems with respect to their functional requirements. In safety critical systems, it plays a crucial role since it is enforced by software standards. Test engineers need to identify all the representative test execution scenarios from requirements, determine the runtime conditions that trigger these scenarios, and finally provide the input data that satisfy these conditions. Given that requirements specifications are typically large and often provided in natural language, the generation of acceptance test cases tends to be expensive and error-prone. In this paper, we present UMTG, an approach that supports the generation of executable, system-level, acceptance test cases from requirements specifications in natural language, with the goal of reducing the manual effort required to generate test cases and ensuring requirements coverage. More specifically, UMTG automates the generation of acceptance test cases based on use case specifications and a domain model for the system under test, which are commonly produced in many development environments. Unlike existing approaches, it does not impose strong restrictions on the expressiveness of use case specifications. We rely on recent advances in natural language processing to automatically identify test scenarios and to generate formal constraints that capture conditions triggering the execution of the scenarios, thus enabling the generation of test data. In two industrial case studies, UMTG automatically and correctly translated 95% of the use case specification steps into formal constraints required for test data generation; furthermore, it generated test cases that exercise not only all the test scenarios manually implemented by experts, but also some critical scenarios not previously considered.

SEMay 28, 2019
Automating System Test Case Classification and Prioritization for Use Case-Driven Testing in Product Lines

Ines Hajri, Arda Goknil, Fabrizio Pastore et al.

Product Line Engineering (PLE) is a crucial practice in many software development environments where software systems are complex and developed for multiple customers with varying needs. At the same time, many development processes are use case-driven and this strongly influences their requirements engineering and system testing practices. In this paper, we propose, apply, and assess an automated system test case classification and prioritization approach specifically targeting system testing in the context of use case-driven development of product families. Our approach provides: (i) automated support to classify, for a new product in a product family, relevant and valid system test cases associated with previous products, and (ii) automated prioritization of system test cases using multiple risk factors such as fault-proneness of requirements and requirements volatility in a product family. Our evaluation was performed in the context of an industrial product family in the automotive domain. Results provide empirical evidence that we propose a practical and beneficial way to classify and prioritize system test cases for industrial product lines.

SEAug 30, 2017
An Exploratory Study of Field Failures

Luca Gazzola, Leonardo Mariani, Fabrizio Pastore et al.

Field failures, that is, failures caused by faults that escape the testing phase leading to failures in the field, are unavoidable. Improving verification and validation activities before deployment can identify and timely remove many but not all faults, and users may still experience a number of annoying problems while using their software systems. This paper investigates the nature of field failures, to understand to what extent further improving in-house verification and validation activities can reduce the number of failures in the field, and frames the need of new approaches that operate in the field. We report the results of the analysis of the bug reports of five applications belonging to three different ecosystems, propose a taxonomy of field failures, and discuss the reasons why failures belonging to the identified classes cannot be detected at design time but shall be addressed at runtime. We observe that many faults (70%) are intrinsically hard to detect at design-time.

SEAug 7, 2017
VART: A Tool for the Automatic Detection of Regression Faults

Fabrizio Pastore, Leonardo Mariani

In this paper we present VART, a tool for automatically revealing regression faults missed by regression test suites. Interestingly, VART is not limited to faults causing crashing or exceptions, but can reveal faults that cause the violation of application-specific correctness properties. VART achieves this goal by combining static and dynamic program analysis.

SEAug 4, 2017
BDCI: Behavioral Driven Conflict Identification

Fabrizio Pastore, Leonardo Mariani, Daniela Micucci

Source Code Management (SCM) systems support software evolution by providing features, such as version control, branching, and conflict detection. Despite the presence of these features, support to parallel software development is often limited. SCM systems can only address a subset of the conflicts that might be introduced by developers when concurrently working on multiple parallel branches. In fact, SCM systems can detect textual conflicts, which are generated by the concurrent modification of the same program locations, but they are unable to detect higher-order conflicts, which are generated by the concurrent modification of different program locations that generate program misbehaviors once merged. Higher-order conflicts are painful to detect and expensive to fix because they might be originated by the interference of apparently unrelated changes. In this paper we present Behavioral Driven Conflict Identification (BDCI), a novel approach to conflict detection. BDCI moves the analysis of conflicts from the source code level to the level of program behavior by generating and comparing behavioral models. The analysis based on behavioral models can reveal interfering changes as soon as they are introduced in the SCM system, even if they do not introduce any textual conflict. To evaluate the effectiveness and the cost of the proposed approach, we developed BDCIf , a specific instance of BDCI dedicated to the detection of higher-order conflicts related to the functional behavior of a program. The evidence collected by analyzing multiple versions of Git and Redis suggests that BDCIf can effectively detect higher-order conflicts and report how changes might interfere.

SEMay 23, 2017
Timed k-Tail: Automatic Inference of Timed Automata

Fabrizio Pastore, Daniela Micucci, Leonardo Mariani

Accurate and up-to-date models describing the be- havior of software systems are seldom available in practice. To address this issue, software engineers may use specification mining techniques, which can automatically derive models that capture the behavior of the system under analysis. So far, most specification mining techniques focused on the functional behavior of the systems, with specific emphasis on models that represent the ordering of operations, such as tempo- ral rules and finite state models. Although useful, these models are inherently partial. For instance, they miss the timing behavior, which is extremely relevant for many classes of systems and com- ponents, such as shared libraries and user-driven applications. Mining specifications that include both the functional and the timing aspects can improve the applicability of many testing and analysis solutions. This paper addresses this challenge by presenting the Timed k-Tail (TkT) specification mining technique that can mine timed automata from program traces. Since timed automata can effectively represent the interplay between the functional and the timing behavior of a system, TkT could be exploited in those contexts where time-related information is relevant. Our empirical evaluation shows that TkT can efficiently and effectively mine accurate models. The mined models have been used to identify executions with anomalous timing. The evaluation shows that most of the anomalous executions have been correctly identified while producing few false positives.