Lorenzo Strigini

h-index26

6papers

86citations

Novelty44%

AI Score29

Ranked #142,921 of 194,257 authors (top 74%)#1,628 in SE (top 54%)

6 Papers

8.0SEMay 4, 2025

On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles

Xingyu Zhao, Robab Aghazadeh-Chakherlou, Chih-Hong Cheng et al.

Scenario-based testing has emerged as a common method for autonomous vehicles (AVs) safety assessment, offering a more efficient alternative to mile-based testing by focusing on high-risk scenarios. However, fundamental questions persist regarding its stopping rules, residual risk estimation, debug effectiveness, and the impact of simulation fidelity on safety claims. This paper argues that a rigorous statistical foundation is essential to address these challenges and enable rigorous safety assurance. By drawing parallels between AV testing and established software testing methods, we identify shared research gaps and reusable solutions. We propose proof-of-concept models to quantify the probability of failure per scenario (\textit{pfs}) and evaluate testing effectiveness under varying conditions. Our analysis reveals that neither scenario-based nor mile-based testing universally outperforms the other. Furthermore, we give an example of formal reasoning about alignment of synthetic and real-world testing outcomes, a first step towards supporting statistically defensible simulation-based safety claims.

4.5AIOct 20, 2021

Bootstrapping confidence in future safety based on past safe operation

Peter Bishop, Andrey Povyakalo, Lorenzo Strigini

With autonomous vehicles (AVs), a major concern is the inability to give meaningful quantitative assurance of safety, to the extent required by society - e.g. that an AV must be at least as safe as a good human driver - before that AV is in extensive use. We demonstrate an approach to achieving more moderate, but useful, confidence, e.g., confidence of low enough probability of causing accidents in the early phases of operation. This formalises mathematically the common approach of operating a system on a limited basis in the hope that mishap-free operation will confirm one's confidence in its safety and allow progressively more extensive operation: a process of "bootstrapping" of confidence. Translating that intuitive approach into theorems shows: (1) that it is substantially sound in the right circumstances, and could be a good method for deciding about the early deployment phase for an AV; (2) how much confidence can be rightly derived from such a "cautious deployment" approach, so that we can avoid over-optimism; (3) under which conditions our sound formulas for future confidence are applicable; (4) thus, which analyses of the concrete situations, and/or constraints on practice, are needed in order to enjoy the advantages of provably correct confidence in adequate future safety.

3.6SEOct 13, 2021

HEDP: A Method for Early Forecasting Software Defects based on Human Error Mechanisms

Fuqun Huang, Lorenzo Strigini

As the primary cause of software defects, human error is the key to understanding, and perhaps to predicting and avoiding them. Little research has been done to predict defects on the basis of the cognitive errors that cause them. This paper proposes an approach to predicting software defects through knowledge about the cognitive mechanisms of human errors. Our theory is that the main process behind a software defect is that an error-prone scenario triggers human error modes, which psychologists have observed to recur across diverse activities. Software defects can then be predicted by identifying such scenarios, guided by this knowledge of typical error modes. The proposed idea emphasizes predicting the exact location and form of a possible defect. We conducted two case studies to demonstrate and validate this approach, with 55 programmers in a programming competition and 5 analysts serving as the users of the approach. We found it impressive that the approach was able to predict, at the requirement phase, the exact locations and forms of 7 out of the 22 (31.8%) specific types of defects that were found in the code. The defects predicted tended to be common defects: their occurrences constituted 75.7% of the total number of defects in the 55 developed programs; each of them was introduced by at least two persons. The fraction of the defects introduced by a programmer that were predicted was on average (over all programmers) 75%. Furthermore, these predicted defects were highly persistent through the debugging process. If the prediction had been used to successfully prevent these defects, this could have saved 46.2% of the debugging iterations. This excellent capability of forecasting the exact locations and forms of possible defects at the early phases of software development recommends the approach for substantial benefits to defect prevention and early detection.

3.7HCJun 25, 2021

Improving Human Decisions by Adjusting the Alerting Thresholds for Computer Alerting Tools According to User and Task Characteristics

Marwa Gadala, Lorenzo Strigini, Peter Ayton

Objective: To investigate whether performance (number of correct decisions) of humans supported by a computer alerting tool can be improved by tailoring the tool's alerting threshold (sensitivity/specificity combination) according to user ability and task difficulty. Background: Many researchers implicitly assume that for each tool there exists a single ideal threshold. But research shows the effects of alerting tools on decision errors to vary depending on variables such as user ability and task difficulty. These findings motivated our investigation. Method: Forty-seven participants edited text passages, aided by a simulated spell-checker tool. We experimentally manipulated passage difficulty and tool alerting threshold, measured participants' editing and dictation ability, and counted participants' decision errors (false positives + false negatives). We investigated whether alerting threshold, user ability, task difficulty and their interactions affected error count. Results: Which alerting threshold better helped a user depended on an interaction between user ability and passage difficulty. Some effects were large: for higher ability users, a more sensitive threshold reduced errors by 30%, on the easier passages. Participants were not significantly more likely to prefer the alerting threshold with which they performed better. Conclusion: Adjusting alerting thresholds for individual users' ability and task difficulty could substantially improve effects of automated alerts on user decisions. This potential deserves further investigation. Improvement size and rules for adjustment will be application-specific. Application: Computer alerting tools have critical roles in many domains. Designers should assess potential benefits of adjustable alerting thresholds for their specific CAT application. Guidance for choosing thresholds will be essential for realizing these benefits in practice.

15.2AIAug 19, 2020

Assessing Safety-Critical Systems from Operational Testing: A Study on Autonomous Vehicles

Xingyu Zhao, Kizito Salako, Lorenzo Strigini et al.

Context: Demonstrating high reliability and safety for safety-critical systems (SCSs) remains a hard problem. Diverse evidence needs to be combined in a rigorous way: in particular, results of operational testing with other evidence from design and verification. Growing use of machine learning in SCSs, by precluding most established methods for gaining assurance, makes operational testing even more important for supporting safety and reliability claims. Objective: We use Autonomous Vehicles (AVs) as a current example to revisit the problem of demonstrating high reliability. AVs are making their debut on public roads: methods for assessing whether an AV is safe enough are urgently needed. We demonstrate how to answer 5 questions that would arise in assessing an AV type, starting with those proposed by a highly-cited study. Method: We apply new theorems extending Conservative Bayesian Inference (CBI), which exploit the rigour of Bayesian methods while reducing the risk of involuntary misuse associated with now-common applications of Bayesian inference; we define additional conditions needed for applying these methods to AVs. Results: Prior knowledge can bring substantial advantages if the AV design allows strong expectations of safety before road testing. We also show how naive attempts at conservative assessment may lead to over-optimism instead; why extrapolating the trend of disengagements is not suitable for safety claims; use of knowledge that an AV has moved to a less stressful environment. Conclusion: While some reliability targets will remain too high to be practically verifiable, CBI removes a major source of doubt: it allows use of prior knowledge without inducing dangerously optimistic biases. For certain ranges of required reliability and prior beliefs, CBI thus supports feasible, sound arguments. Useful conservative claims can be derived from limited prior knowledge.

4.0SEApr 28, 2014

Evaluating the Assessment of Software Fault-Freeness

John Rushby, Bev Littlewood, Lorenzo Strigini

We propose to validate experimentally a theory of software certification that proceeds from assessment of confidence in fault-freeness (due to standards) to conservative prediction of failure-free operation.