CVMar 3, 2023
Evaluation of Confidence-based Ensembling in Deep Learning Image ClassificationRafael Rosales, Peter Popov, Michael Paulitsch
Ensembling is a successful technique to improve the performance of machine learning (ML) models. Conf-Ensemble is an adaptation to Boosting to create ensembles based on model confidence instead of model errors to better classify difficult edge-cases. The key idea is to create successive model experts for samples that were difficult (not necessarily incorrectly classified) by the preceding model. This technique has been shown to provide better results than boosting in binary-classification with a small feature space (~80 features). In this paper, we evaluate the Conf-Ensemble approach in the much more complex task of image classification with the ImageNet dataset (224x224x3 features with 1000 classes). Image classification is an important benchmark for AI-based perception and thus it helps to assess if this method can be used in safety-critical applications using ML ensembles. Our experiments indicate that in a complex multi-label classification task, the expected benefit of specialization on complex input samples cannot be achieved with a small sample set, i.e., a good classifier seems to rely on very complex feature analysis that cannot be well trained on just a limited subset of "difficult samples". We propose an improvement to Conf-Ensemble to increase the number of samples fed to successive ensemble members, and a three-member Conf-Ensemble using this improvement was able to surpass a single model in accuracy, although the amount is not significant. Our findings shed light on the limits of the approach and the non-triviality of harnessing big data.
SENov 1, 2025
HIP-LLM: A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language ModelsRobab Aghazadeh-Chakherlou, Qing Guo, Siddartha Khastgir et al.
Large Language Models (LLMs) are increasingly deployed across diverse domains, raising the need for rigorous reliability assessment methods. Existing benchmark-based evaluations primarily offer descriptive statistics of model accuracy over datasets, providing limited insight into the probabilistic behavior of LLMs under real operational conditions. This paper introduces HIP-LLM, a Hierarchical Imprecise Probability framework for modeling and inferring LLM reliability. Building upon the foundations of software reliability engineering, HIP-LLM defines LLM reliability as the probability of failure-free operation over a specified number of future tasks under a given Operational Profile (OP). HIP-LLM represents dependencies across (sub-)domains hierarchically, enabling multi-level inference from subdomain to system-level reliability. HIP-LLM embeds imprecise priors to capture epistemic uncertainty and incorporates OPs to reflect usage contexts. It derives posterior reliability envelopes that quantify uncertainty across priors and data. Experiments on multiple benchmark datasets demonstrate that HIP-LLM offers a more accurate and standardized reliability characterization than existing benchmark and state-of-the-art approaches. A publicly accessible repository of HIP-LLM is provided.
SEMay 4, 2025
On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous VehiclesXingyu Zhao, Robab Aghazadeh-Chakherlou, Chih-Hong Cheng et al.
Scenario-based testing has emerged as a common method for autonomous vehicles (AVs) safety assessment, offering a more efficient alternative to mile-based testing by focusing on high-risk scenarios. However, fundamental questions persist regarding its stopping rules, residual risk estimation, debug effectiveness, and the impact of simulation fidelity on safety claims. This paper argues that a rigorous statistical foundation is essential to address these challenges and enable rigorous safety assurance. By drawing parallels between AV testing and established software testing methods, we identify shared research gaps and reusable solutions. We propose proof-of-concept models to quantify the probability of failure per scenario (\textit{pfs}) and evaluate testing effectiveness under varying conditions. Our analysis reveals that neither scenario-based nor mile-based testing universally outperforms the other. Furthermore, we give an example of formal reasoning about alignment of synthetic and real-world testing outcomes, a first step towards supporting statistically defensible simulation-based safety claims.
SEFeb 28, 2020
Towards Identifying and closing Gaps in Assurance of autonomous Road vehicleS -- a collection of Technical Notes Part 2Robin Bloomfield, Gareth Fletcher, Heidy Khlaaf et al.
This report provides an introduction and overview of the Technical Topic Notes (TTNs) produced in the Towards Identifying and closing Gaps in Assurance of autonomous Road vehicleS (Tigars) project. These notes aim to support the development and evaluation of autonomous vehicles. Part 1 addresses: Assurance-overview and issues, Resilience and Safety Requirements, Open Systems Perspective and Formal Verification and Static Analysis of ML Systems. This report is Part 2 and discusses: Simulation and Dynamic Testing, Defence in Depth and Diversity, Security-Informed Safety Analysis, Standards and Guidelines.
SEFeb 28, 2020
Towards Identifying and closing Gaps in Assurance of autonomous Road vehicleS -- a collection of Technical Notes Part 1Robin Bloomfield, Gareth Fletcher, Heidy Khlaaf et al.
This report provides an introduction and overview of the Technical Topic Notes (TTNs) produced in the Towards Identifying and closing Gaps in Assurance of autonomous Road vehicleS (Tigars) project. These notes aim to support the development and evaluation of autonomous vehicles. Part 1 addresses: Assurance-overview and issues, Resilience and Safety Requirements, Open Systems Perspective and Formal Verification and Static Analysis of ML Systems. Part 2: Simulation and Dynamic Testing, Defence in Depth and Diversity, Security-Informed Safety Analysis, Standards and Guidelines.
DCOct 7, 2015
Security-aware selection of Web Services for Reliable CompositionShahedeh Khani, Cristina Gacek, Peter Popov
Dependability is an important characteristic that a trustworthy computer system should have. It is a measure of Availability, Reliability, Maintainability, Safety and Security. The focus of our research is on security of web services. Web services enable the composition of independent services with complementary functionalities to produce value-added services, which allows organizations to implement their core business only and outsource other service components over the Internet, either pre-selected or on-the-fly. The selected third party web services may have security vulnerabilities. Vulnerable web services are of limited practical use. We propose to use an intrusion-tolerant composite web service for each functionality that should be fulfilled by a third party web service. The third party services employed in this approach should be selected based on their security vulnerabilities in addition to their performance. The security vulnerabilities of the third party services are assessed using a penetration testing tool. In this paper we present our preliminary research work.