SEAug 20, 2024
LeCov: Multi-level Testing Criteria for Large Language ModelsXuan Xie, Jiayang Song, Yuheng Huang et al.
Large Language Models (LLMs) are widely used in many different domains, but because of their limited interpretability, there are questions about how trustworthy they are in various perspectives, e.g., truthfulness and toxicity. Recent research has started developing testing methods for LLMs, aiming to uncover untrustworthy issues, i.e., defects, before deployment. However, systematic and formalized testing criteria are lacking, which hinders a comprehensive assessment of the extent and adequacy of testing exploration. To mitigate this threat, we propose a set of multi-level testing criteria, LeCov, for LLMs. The criteria consider three crucial LLM internal components, i.e., the attention mechanism, feed-forward neurons, and uncertainty, and contain nine types of testing criteria in total. We apply the criteria in two scenarios: test prioritization and coverage-guided testing. The experiment evaluation, on three models and four datasets, demonstrates the usefulness and effectiveness of LeCov.
PLDec 5, 2019Code
Perfectly Parallel Fairness Certification of Neural NetworksCaterina Urban, Maria Christakis, Valentin Wüstholz et al.
Recently, there is growing concern that machine-learning models, which currently assist or even automate decision making, reproduce, and in the worst case reinforce, bias of the training data. The development of tools and techniques for certifying fairness of these models or describing their biased behavior is, therefore, critical. In this paper, we propose a perfectly parallel static analysis for certifying causal fairness of feed-forward neural networks used for classification of tabular data. When certification succeeds, our approach provides definite guarantees, otherwise, it describes and quantifies the biased behavior. We design the analysis to be sound, in practice also exact, and configurable in terms of scalability and precision, thereby enabling pay-as-you-go certification. We implement our approach in an open-source tool and demonstrate its effectiveness on models trained with popular datasets.
SEFeb 20, 2017Code
Refinement-based Specification and Security Analysis of Separation KernelsYongwang Zhao, David Sanan, Fuyuan Zhang et al.
Assurance of information-flow security by formal methods is mandated in security certification of separation kernels. As an industrial standard for improving safety, ARINC 653 has been complied with by mainstream separation kernels. Due to the new trend of integrating safe and secure functionalities into one separation kernel, security analysis of ARINC 653 as well as a formal specification with security proofs are thus significant for the development and certification of ARINC 653 compliant Separation Kernels (ARINC SKs). This paper presents a specification development and security analysis method for ARINC SKs based on refinement. We propose a generic security model and a stepwise refinement framework. Two levels of functional specification are developed by the refinement. A major part of separation kernel requirements in ARINC 653 are modeled, such as kernel initialization, two-level scheduling, partition and process management, and inter-partition communication. The formal specification and its security proofs are carried out in the Isabelle/HOL theorem prover. We have reviewed the source code of one industrial and two open-source ARINC SK implementations, i.e. VxWorks 653, XtratuM, and POK, in accordance with the formal specification. During the verification and code review, six security flaws, which can cause information leakage, are found in the ARINC 653 standard and the implementations.
SEOct 17, 2015Code
Reasoning About Information Flow Security of Separation Kernels with Channel-based CommunicationYongwang Zhao, David Sann, Fuyuan Zhang et al.
Assurance of information flow security by formal methods is mandated in security certification of separation kernels. As an industrial standard for separation kernels, ARINC 653 has been complied with by mainstream separation kernels. Security of functionalities defined in ARINC 653 is thus very important for the development and certification of separation kernels. This paper presents the first effort to formally specify and verify separation kernels with ARINC 653 channel-based communication. We provide a reusable formal specification and security proofs for separation kernels in Isabelle/HOL. During reasoning about information flow security, we find some security flaws in the ARINC 653 standard, which can cause information leakage, and fix them in our specification. We also validate the existence of the security flaws in two open-source ARINC 653 compliant separation kernels.
CVJul 8, 2025
AR2: Attention-Guided Repair for the Robustness of CNNs Against Common CorruptionsFuyuan Zhang, Qichen Wang, Jianjun Zhao
Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. In this paper, we propose AR2 (Attention-Guided Repair for Robustness), a simple yet effective method to enhance the corruption robustness of pretrained CNNs. AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, encouraging the model to maintain consistent attention even under input perturbations. Our approach follows an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning, without requiring architectural changes. Extensive experiments show that AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. These results demonstrate that AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions.
SEJan 20, 2022
DeepGalaxy: Testing Neural Network Verifiers via Two-Dimensional Input Space ExplorationXuan Xie, Fuyuan Zhang
Deep neural networks (DNNs) are widely developed and applied in many areas, and the quality assurance of DNNs is critical. Neural network verification (NNV) aims to provide formal guarantees to DNN models. Similar to traditional software, neural network verifiers could also contain bugs, which would have a critical and serious impact, especially in safety-critical areas. However, little work exists on validating neural network verifiers. In this work, we propose DeepGalaxy, an automated approach based on differential testing to tackle this problem. Specifically, we (1) propose a line of mutation rules, including model level mutation and specification level mutation, to effectively explore the two-dimensional input space of neural network verifiers; and (2) propose heuristic strategies to select test cases. We leveraged our implementation of DeepGalaxy to test three state-of-the-art neural network verifies, Marabou, Eran, and Neurify. The experimental results support the efficiency and effectiveness of DeepGalaxy. Moreover, five unique unknown bugs were discovered
SEApr 13, 2020
Detecting Critical Bugs in SMT Solvers Using Blackbox Mutational FuzzingMuhammad Numair Mansur, Maria Christakis, Valentin Wüstholz et al.
Formal methods use SMT solvers extensively for deciding formula satisfiability, for instance, in software verification, systematic test generation, and program synthesis. However, due to their complex implementations, solvers may contain critical bugs that lead to unsound results. Given the wide applicability of solvers in software reliability, relying on such unsound results may have detrimental consequences. In this paper, we present STORM, a novel blackbox mutational fuzzing technique for detecting critical bugs in SMT solvers. We run our fuzzer on seven mature solvers and find 29 previously unknown critical bugs. STORM is already being used in testing new features of popular solvers before deployment.
LGOct 14, 2019
DeepSearch: A Simple and Effective Blackbox Attack for Deep Neural NetworksFuyuan Zhang, Sankalan Pal Chowdhury, Maria Christakis
Although deep neural networks have been very successful in image-classification tasks, they are prone to adversarial attacks. To generate adversarial inputs, there has emerged a wide variety of techniques, such as black- and whitebox attacks for neural networks. In this paper, we present DeepSearch, a novel fuzzing-based, query-efficient, blackbox attack for image classifiers. Despite its simplicity, DeepSearch is shown to be more effective in finding adversarial inputs than state-of-the-art blackbox approaches. DeepSearch is additionally able to generate the most subtle adversarial inputs in comparison to these approaches.
SEOct 18, 2018
An Event-based Compositional Reasoning Approach for Concurrent Reactive SystemsYongwang Zhao, David Sanan, Fuyuan Zhang et al.
Reactive systems are composed of a well defined set of input events that the system reacts with by executing an associated handler to each event. In concurrent environments, event handlers can interact with the execution of other programs such as hardware interruptions in preemptive systems, or other instances of the reactive system in multicore architectures. State of the art rely-guarantee based verification frameworks only focus on imperative programs, being difficult to capture in the rely and guarantee relations interactions with possible infinite sequences of event handlers, and the input arguments to event handlers. In this paper, we propose the formalisation in Isabelle/HOL of an event-based rely-guarantee approach for concurrent reactive systems. We develop a Pi-Core language which incorporates a concurrent imperative and system specification language by `events', and we build a rely-guarantee proof system for Pi-Core and prove its soundness. Our approach can deal with multicore and interruptible concurrency. We use two case studies to show this: an interruptible controller for stepper motors and an ARINC 653 multicore kernel, and prove the functional correctness and preservation of invariants of them in Isabelle/HOL.
SEJun 20, 2018
Combinatorial Testing for Deep Learning SystemsLei Ma, Fuyuan Zhang, Minhui Xue et al.
Deep learning (DL) has achieved remarkable progress over the past decade and been widely applied to many safety-critical applications. However, the robustness of DL systems recently receives great concerns, such as adversarial examples against computer vision systems, which could potentially result in severe consequences. Adopting testing techniques could help to evaluate the robustness of a DL system and therefore detect vulnerabilities at an early stage. The main challenge of testing such systems is that its runtime state space is too large: if we view each neuron as a runtime state for DL, then a DL system often contains massive states, rendering testing each state almost impossible. For traditional software, combinatorial testing (CT) is an effective testing technique to reduce the testing space while obtaining relatively high defect detection abilities. In this paper, we perform an exploratory study of CT on DL systems. We adapt the concept in CT and propose a set of coverage criteria for DL systems, as well as a CT coverage guided test generation technique. Our evaluation demonstrates that CT provides a promising avenue for testing DL systems. We further pose several open questions and interesting directions for combinatorial testing of DL systems.
SEMay 14, 2018
DeepMutation: Mutation Testing of Deep Learning SystemsLei Ma, Fuyuan Zhang, Jiyuan Sun et al.
Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using an inadequate test dataset, DL models that have achieved high test accuracy may still lack generality and robustness. In traditional software testing, mutation testing is a well-established technique for quality evaluation of test suites, which analyzes to what extent a test suite detects the injected faults. However, due to the fundamental difference between traditional software and deep learning-based software, traditional mutation testing techniques cannot be directly applied to DL systems. In this paper, we propose a mutation testing framework specialized for DL systems to measure the quality of test data. To do this, by sharing the same spirit of mutation testing in traditional software, we first define a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs). Then we design a set of model-level mutation operators that directly inject faults into DL models without a training process. Eventually, the quality of test data could be evaluated from the analysis on to what extent the injected faults could be detected. The usefulness of the proposed mutation testing techniques is demonstrated on two public datasets, namely MNIST and CIFAR-10, with three DL models.
SEMar 20, 2018
DeepGauge: Multi-Granularity Testing Criteria for Deep Learning SystemsLei Ma, Felix Juefei-Xu, Fuyuan Zhang et al.
Deep learning (DL) defines a new data-driven programming paradigm that constructs the internal system logic of a crafted neuron network through a set of training data. We have seen wide adoption of DL in many safety-critical scenarios. However, a plethora of studies have shown that the state-of-the-art DL systems suffer from various vulnerabilities which can lead to severe consequences when applied to real-world applications. Currently, the testing adequacy of a DL system is usually measured by the accuracy of test data. Considering the limitation of accessible high quality test data, good accuracy performance on test data can hardly provide confidence to the testing adequacy and generality of DL systems. Unlike traditional software systems that have clear and controllable logic and functionality, the lack of interpretability in a DL system makes system analysis and defect detection difficult, which could potentially hinder its real-world deployment. In this paper, we propose DeepGauge, a set of multi-granularity testing criteria for DL systems, which aims at rendering a multi-faceted portrayal of the testbed. The in-depth evaluation of our proposed testing criteria is demonstrated on two well-known datasets, five DL systems, and with four state-of-the-art adversarial attack techniques against DL. The potential usefulness of DeepGauge sheds light on the construction of more generic and robust DL systems.
SEJan 6, 2017
High-Assurance Separation Kernels: A Survey on Formal MethodsYongwang Zhao, David Sanan, Fuyuan Zhang et al.
Separation kernels provide temporal/spatial separation and controlled information flow to their hosted applications. They are introduced to decouple the analysis of applications in partitions from the analysis of the kernel itself. More than 20 implementations of separation kernels have been developed and widely applied in critical domains, e.g., avionics/aerospace, military/defense, and medical devices. Formal methods are mandated by the security/safety certification of separation kernels and have been carried out since this concept emerged. However, this field lacks a survey to systematically study, compare, and analyze related work. On the other hand, high-assurance separation kernels by formal methods still face big challenges. In this paper, an analytical framework is first proposed to clarify the functionalities, implementations, properties and standards, and formal methods application of separation kernels. Based on the proposed analytical framework, a taxonomy is designed according to formal methods application, functionalities, and properties of separation kernels. Research works in the literature are then categorized and overviewed by the taxonomy. In accordance with the analytical framework, a comprehensive analysis and discussion of related work are presented. Finally, four challenges and their possible technical directions for future research are identified, e.g. specification bottleneck, multicore and concurrency, and automation of full formal verification.
FLNov 2, 2016
Compositional Reasoning for Shared-variable Concurrent ProgramsFuyuan Zhang, Yongwang Zhao, David Sanan et al.
Scalable and automatic formal verification for concurrent systems is always demanding. In this paper, we propose a verification framework to support automated compositional reasoning for concurrent programs with shared variables. Our framework models concurrent programs as succinct automata and supports the verification of multiple important properties. Safety verification and simulations of succinct automata are parallel compositional, and safety properties of succinct automata are preserved under refinements. We generate succinct automata from infinite state concurrent programs in an automated manner. Furthermore, we propose the first automated approach to checking rely-guarantee based simulations between infinite state concurrent programs. We have prototyped our algorithms and applied our tool to the verification of multiple refinements.