Daniel Kroening

LG
h-index55
55papers
1,988citations
Novelty56%
AI Score60

55 Papers

LGSep 21, 2022Code
LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning

Hosein Hasanbeig, Daniel Kroening, Alessandro Abate

LCRL is a software tool that implements model-free Reinforcement Learning (RL) algorithms over unknown Markov Decision Processes (MDPs), synthesising policies that satisfy a given linear temporal specification with maximal probability. LCRL leverages partially deterministic finite-state machines known as Limit Deterministic Buchi Automata (LDBA) to express a given linear temporal specification. A reward function for the RL algorithm is shaped on-the-fly, based on the structure of the LDBA. Theoretical guarantees under proper assumptions ensure the convergence of the RL algorithm to an optimal policy that maximises the satisfaction probability. We present case studies to demonstrate the applicability, ease of use, scalability, and performance of LCRL. Owing to the LDBA-guided exploration and LCRL model-free architecture, we observe robust performance, which also scales well when compared to standard RL approaches (whenever applicable to LTL specifications). Full instructions on how to execute all the case studies in this paper are provided on a GitHub page that accompanies the LCRL distribution www.github.com/grockious/lcrl.

100.0LOMay 1Code
Large Lemma Miners: Can LLMs do Induction Proofs for Hardware?

Romy Peled, Daniel Kroening, Michael Tautschnig et al.

Large Language Models (LLMs) have shown potential for solving mathematical tasks. We show that LLMs can be utilized to generate proofs by induction for hardware verification and thereby replace some of the manual work done by Formal Verification engineers and deliver industrial value. We present a neurosymbolic approach that includes two prompting frameworks to generate candidate invariants, which are checked using a formal, symbolic tool. Our results indicate that with sufficient reprompting, LLMs are able to generate inductive arguments for mid-size open-source RTL designs. For 84% of our problem set, at least one of the prompt setups succeeded in producing a provably correct inductive argument.

SYMay 6, 2017
Automated Formal Synthesis of Digital Controllers for State-Space Physical Plants

Alessandro Abate, Iury Bessa, Dario Cattaruzza et al.

We present a sound and automated approach to synthesize safe digital feedback controllers for physical plants represented as linear, time invariant models. Models are given as dynamical equations with inputs, evolving over a continuous state space and accounting for errors due to the digitalization of signals by the controller. Our approach has two stages, leveraging counterexample guided inductive synthesis (CEGIS) and reachability analysis. CEGIS synthesizes a static feedback controller that stabilizes the system under restrictions given by the safety of the reach space. Safety is verified either via BMC or abstract acceleration; if the verification step fails, we refine the controller by generalizing the counterexample. We synthesize stable and safe controllers for intricate physical plant models from the digital control literature.

SYFeb 16, 2017
Sound and Automated Synthesis of Digital Stabilizing Controllers for Continuous Plants

Alessandro Abate, Iury Bessa, Dario Cattaruzza et al.

Modern control is implemented with digital microcontrollers, embedded within a dynamical plant that represents physical components. We present a new algorithm based on counter-example guided inductive synthesis that automates the design of digital controllers that are correct by construction. The synthesis result is sound with respect to the complete range of approximations, including time discretization, quantization effects, and finite-precision arithmetic and its rounding errors. We have implemented our new algorithm in a tool called DSSynth, and are able to automatically generate stable controllers for a set of intricate plant models taken from the literature within minutes.

SYFeb 18, 2017
Verifying Digital Systems with MATLAB

Lennon Chaves, Iury Bessa, Lucas Cordeiro et al.

A MATLAB toolbox is presented, with the goal of checking occurrences of design errors typically found in fixed-point digital systems, considering finite word-length effects. In particular, the present toolbox works as a front-end to a recently introduced verification tool, known as Digital-System Verifier, and checks overflow, limit cycle, quantization, stability, and minimum phase errors, in digital systems represented by transfer-function and state-space equations. It provides a command-line version, with simplified access to specific functions, and a graphical-user interface, which was developed as a MATLAB application. The resulting toolbox is important for the verification community, since it shows the applicability of verification to real-world systems.

CVSep 25, 2023
Multiple Different Black Box Explanations for Image Classifiers

Hana Chockler, David A. Kelly, Daniel Kroening

Existing explanation tools for image classifiers usually give only a single explanation for an image's classification. For many images, however, image classifiers accept more than one explanation for the image label. These explanations are useful for analyzing the decision process of the classifier and for detecting errors. Thus, restricting the number of explanations to just one severely limits insight into the behavior of the classifier. In this paper, we describe an algorithm and a tool, MultEX, for computing multiple explanations as the output of a black-box image classifier for a given image. Our algorithm uses a principled approach based on actual causality. We analyze its theoretical complexity and evaluate MultEX against the state-of-the-art across three different models and three different datasets. We find that MultEX finds more explanations and that these explanations are of higher quality.

63.3SEMay 20
Agentic Model Checking

Youcheng Sun, Jiawen Liu, Daniel Kroening et al.

Verifying LLM-generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness-relevant decision. The paradigm rests on three commitments. Specifications are inferred top-down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional-correctness clauses lifting verification from panic-freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition-constrained stubs, so per-query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in-tree crashes from latent public-API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC-Agent and evaluate it on LLM-generated kernel and compiler code in C and Rust alongside mature OSS-Fuzz-hardened libraries, confirming real defects, producing bounded clean verifications on heavily-fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.

40.1LGMay 1
EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems

Luan Pham, Victor Nicolet, Joey Dodds et al.

Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of our framework, we conduct a systematic analysis on 520 real-world incidents, and provide insights into how anomalies and their root causes manifest through event data. EventADL has three phases: offline training, online anomaly detection, and root cause localization. During the training phase, EventADL first learns Event Semantic Patterns (ESPs), which capture normal interactions between system entities using historical event data, and then learns Event Frequency Patterns (EFPs), which capture the normal frequency of known ESPs. In the online anomaly detection phase, any data in the event stream that deviates significantly from either pattern is identified as anomalous. For localization, EventADL constructs an Intervention Graph that models the relationships between recent system interactions and the detected anomalies for automatic root cause localization. The framework is designed to operate efficiently with unlabeled data and to produce interpretable anomalies with their corresponding root causes. Our evaluation on three real cloud service systems and two real-world incidents demonstrates that EventADL outperforms existing methods, achieving F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization.

CVNov 23, 2023
You Only Explain Once

David A. Kelly, Hana Chockler, Daniel Kroening et al.

In this paper, we propose a new black-box explainability algorithm and tool, YO-ReX, for efficient explanation of the outputs of object detectors. The new algorithm computes explanations for all objects detected in the image simultaneously. Hence, compared to the baseline, the new algorithm reduces the number of queries by a factor of 10X for the case of ten detected objects. The speedup increases further with with the number of objects. Our experimental results demonstrate that YO-ReX can explain the outputs of YOLO with a negligible overhead over the running time of YOLO. We also demonstrate similar results for explaining SSD and Faster R-CNN. The speedup is achieved by avoiding backtracking by combining aggressive pruning with a causal analysis.

LGMar 10, 2018Code
Testing Deep Neural Networks

Youcheng Sun, Xiaowei Huang, Daniel Kroening et al.

Deep neural networks (DNNs) have a wide range of applications, and software employing them must be thoroughly tested, especially in safety-critical domains. However, traditional software test coverage metrics cannot be applied directly to DNNs. In this paper, inspired by the MC/DC coverage criterion, we propose a family of four novel test criteria that are tailored to structural features of DNNs and their semantics. We validate the criteria by demonstrating that the generated test inputs guided via our proposed coverage criteria are able to capture undesired behaviours in a DNN. Test cases are generated using a symbolic approach and a gradient-based heuristic search. By comparing them with existing methods, we show that our criteria achieve a balance between their ability to find bugs (proxied using adversarial examples) and the computational cost of test case generation. Our experiments are conducted on state-of-the-art DNNs obtained using popular open source datasets, including MNIST, CIFAR-10 and ImageNet.

SESep 1, 2016Code
Equivalence Checking a Floating-point Unit against a High-level C Model (Extended Version)

Rajdeep Mukherjee, Saurabh Joshi, Andreas Griesmayer et al.

Semiconductor companies have increasingly adopted a methodology that starts with a system-level design specification in C/C++/SystemC. This model is extensively simulated to ensure correct functionality and performance. Later, a Register Transfer Level (RTL) implementation is created in Verilog, either manually by a designer or automatically by a high-level synthesis tool. It is essential to check that the C and Verilog programs are consistent. In this paper, we present a two-step approach, embodied in two equivalence checking tools, VERIFOX and HW-CBMC, to validate designs at the software and RTL levels, respectively. VERIFOX is used for equivalence checking of an untimed software model in C against a high-level reference model in C. HW-CBMC verifies the equivalence of a Verilog RTL implementation against an untimed software model in C. To evaluate our tools, we applied them to a commercial floating-point arithmetic unit (FPU) from ARM and an open-source dual-path floating-point adder.

LOOct 31, 2024
Neural Model Checking

Mirco Giacobbe, Daniel Kroening, Abhinandan Pal et al.

We introduce a machine learning approach to model checking temporal logic, with application to formal hardware verification. Model checking answers the question of whether every execution of a given system satisfies a desired temporal logic specification. Unlike testing, model checking provides formal guarantees. Its application is expected standard in silicon design and the EDA industry has invested decades into the development of performant symbolic model checking algorithms. Our new approach combines machine learning and symbolic reasoning by using neural networks as formal proof certificates for linear temporal logic. We train our neural certificates from randomly generated executions of the system and we then symbolically check their validity using satisfiability solving which, upon the affirmative answer, establishes that the system provably satisfies the specification. We leverage the expressive power of neural networks to represent proof certificates as well as the fact that checking a certificate is much simpler than finding one. As a result, our machine learning procedure for model checking is entirely unsupervised, formally sound, and practically effective. We experimentally demonstrate that our method outperforms the state-of-the-art academic and commercial model checkers on a set of standard hardware designs written in SystemVerilog.

AINov 13, 2024
Causal Explanations for Image Classifiers

Hana Chockler, David A. Kelly, Daniel Kroening et al.

Existing algorithms for explaining the output of image classifiers use different definitions of explanations and a variety of techniques to extract them. However, none of the existing tools use a principled approach based on formal definitions of causes and explanations for the explanation extraction. In this paper we present a novel black-box approach to computing explanations grounded in the theory of actual causality. We prove relevant theoretical results and present an algorithm for computing approximate explanations based on these definitions. We prove termination of our algorithm and discuss its complexity and the amount of approximation compared to the precise definition. We implemented the framework in a tool ReX and we present experimental results and a comparison with state-of-the-art tools. We demonstrate that \rex is the most efficient tool and produces the smallest explanations, in addition to outperforming other black-box tools on standard quality measures.

98.6SEApr 8
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

Ali Reza Ibrahimzada, Brandon Paulsen, Daniel Kroening et al.

Most repository-level code translation and validation techniques have been evaluated on a single source-target programming language (PL) pair, owing to the complex engineering effort required to adapt new PL pairs. Programming agents can enable PL-agnosticism in repository-level code translation and validation: they can synthesize code across many PLs and autonomously use existing tools specific to each PL's analysis. However, state-of-the-art has yet to offer a fully autonomous agentic approach for repository-level code translation and validation of large-scale programs. This paper proposes ReCodeAgent, an autonomous multi-agent approach for language-agnostic repository-level code translation and validation. Users only need to provide the project in the source PL and specify the target PL for ReCodeAgent to automatically translate and validate the entire repository. ReCodeAgent is the first technique to achieve high translation success rates across many PLs. We compare the effectiveness of ReCodeAgent with four alternative neuro-symbolic and agentic approaches to translate 118 real-world projects, with 1,975 LoC and 43 translation units for each project, on average. The projects cover 6 PLs (C, Go, Java, JavaScript, Python, and Rust) and 4 PL pairs (C-Rust, Go-Rust, Java-Python, Python-JavaScript). Our results demonstrate that ReCodeAgent consistently outperforms prior techniques on translation correctness, improving test pass rate by 60.8% on ground-truth tests, with an average cost of $15.3. We also perform process-centric analysis of ReCodeAgent trajectories to confirm its procedural efficiency. Finally, we investigate how the design choices (a multi-agent vs. single-agent architecture) influence ReCodeAgent performance: on average, the test pass rate drops by 40.4%, and trajectories become 28% longer and persistently inefficient.

LGDec 18, 2023
Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis

Rohan Mitta, Hosein Hasanbeig, Jun Wang et al.

This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL), such that the safety constraint violations are bounded at any point during learning. In a variety of RL applications the safety of the agent is particularly important, e.g. autonomous platforms or robots that work in proximity of humans. As enforcing safety during training might severely limit the agent's exploration, we propose here a new architecture that handles the trade-off between efficient progress and safety during exploration. As the exploration progresses, we update via Bayesian inference Dirichlet-Categorical models of the transition probabilities of the Markov decision process that describes the environment dynamics. This paper proposes a way to approximate moments of belief about the risk associated to the action selection policy. We construct those approximations, and prove the convergence results. We propose a novel method for leveraging the expectation approximations to derive an approximate bound on the confidence that the risk is below a certain level. This approach can be easily interleaved with RL and we present experimental results to showcase the performance of the overall architecture.

SEOct 23, 2025
Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents

Zhenning Yang, Hui Guan, Victor Nicolet et al.

Cloud infrastructure is managed through a mix of interfaces -- traditionally, cloud consoles, command-line interfaces (CLI), and SDKs are the tools of choice. Recently, Infrastructure-as-Code/IaC frameworks (e.g., Terraform) have quickly gained popularity. Unlike conventional tools, IaC~frameworks encode the infrastructure in a "source-of-truth" configuration. They are capable of automatically carrying out modifications to the cloud -- deploying, updating, or destroying resources -- to bring the actual infrastructure into alignment with the IaC configuration. However, when IaC is used alongside consoles, CLIs, or SDKs, it loses visibility into external changes, causing infrastructure drift, where the configuration becomes outdated, and later IaC operations may undo valid updates or trigger errors. We present NSync, an automated system for IaC reconciliation that propagates out-of-band changes back into the IaC program. Our key insight is that infrastructure changes eventually all occur via cloud API invocations -- the lowest layer for cloud management operations. NSync gleans insights from API traces to detect drift (i.e., non-IaC changes) and reconcile it (i.e., update the IaC configuration to capture the changes). It employs an agentic architecture that leverages LLMs to infer high-level intents from noisy API sequences, synthesize targeted IaC updates using specialized tools, and continually improve through a self-evolving knowledge base of past reconciliations. We further introduce a novel evaluation pipeline for injecting realistic drifts into cloud infrastructure and assessing reconciliation performance. Experiments across five real-world Terraform projects and 372 drift scenarios show that NSync outperforms the baseline both in terms of accuracy (from 0.71 to 0.97 pass@3) and token efficiency (1.47$\times$ improvement).

SESep 19, 2025
MatchFixAgent: Language-Agnostic Autonomous Repository-Level Code Translation Validation and Repair

Ali Reza Ibrahimzada, Brandon Paulsen, Reyhaneh Jabbarvand et al.

Code translation transforms source code from one programming language (PL) to another. Validating the functional equivalence of translation and repairing, if necessary, are critical steps in code translation. Existing automated validation and repair approaches struggle to generalize to many PLs due to high engineering overhead, and they rely on existing and often inadequate test suites, which results in false claims of equivalence and ineffective translation repair. We develop MatchFixAgent, a large language model (LLM)-based, PL-agnostic framework for equivalence validation and repair of translations. MatchFixAgent features a multi-agent architecture that divides equivalence validation into several sub-tasks to ensure thorough and consistent semantic analysis of the translation. Then it feeds this analysis to test agent to write and execute tests. Upon observing a test failure, the repair agent attempts to fix the translation bug. The final (in)equivalence decision is made by the verdict agent, considering semantic analyses and test execution results. We compare MatchFixAgent's validation and repair results with four repository-level code translation techniques. We use 2,219 translation pairs from their artifacts, which cover 6 PL pairs, and are collected from 24 GitHub projects totaling over 900K lines of code. Our results demonstrate that MatchFixAgent produces (in)equivalence verdicts for 99.2% of translation pairs, with the same equivalence validation result as prior work on 72.8% of them. When MatchFixAgent's result disagrees with prior work, we find that 60.7% of the time MatchFixAgent's result is actually correct. In addition, we show that MatchFixAgent can repair 50.6% of inequivalent translation, compared to prior work's 18.5%. This demonstrates that MatchFixAgent is far more adaptable to many PL pairs than prior work, while producing highly accurate validation results.

SESep 8, 2025
Hypergraph-Guided Regex Filter Synthesis for Event-Based Anomaly Detection

Margarida Ferreira, Victor Nicolet, Luan Pham et al. · cmu

We propose HyGLAD, a novel algorithm that automatically builds a set of interpretable patterns that model event data. These patterns can then be used to detect event-based anomalies in a stationary system, where any deviation from past behavior may indicate malicious activity. The algorithm infers equivalence classes of entities with similar behavior observed from the events, and then builds regular expressions that capture the values of those entities. As opposed to deep-learning approaches, the regular expressions are directly interpretable, which also translates to interpretable anomalies. We evaluate HyGLAD against all 7 unsupervised anomaly detection methods from DeepOD on five datasets from real-world systems. The experimental results show that on average HyGLAD outperforms existing deep-learning methods while being an order of magnitude more efficient in training and inference (single CPU vs GPU). Precision improved by 1.2x and recall by 1.3x compared to the second-best baseline.

LGJun 1, 2021
Exposing Previously Undetectable Faults in Deep Neural Networks

Isaac Dunn, Hadrien Pouget, Daniel Kroening et al.

Existing methods for testing DNNs solve the oracle problem by constraining the raw features (e.g. image pixel values) to be within a small distance of a dataset example for which the desired DNN output is known. But this limits the kinds of faults these approaches are able to detect. In this paper, we introduce a novel DNN testing method that is able to find faults in DNNs that other methods cannot. The crux is that, by leveraging generative machine learning, we can generate fresh test inputs that vary in their high-level features (for images, these include object shape, location, texture, and colour). We demonstrate that our approach is capable of detecting deliberately injected faults as well as new faults in state-of-the-art DNNs, and that in both cases, existing methods are unable to find these faults.

LGMar 5, 2021
Explanations for Occluded Images

Hana Chockler, Daniel Kroening, Youcheng Sun

Existing algorithms for explaining the output of image classifiers perform poorly on inputs where the object of interest is partially occluded. We present a novel, black-box algorithm for computing explanations that uses a principled approach based on causal theory. We have implemented the method in the DEEPCOVER tool. We obtain explanations that are much more accurate than those generated by the existing explanation tools on images with occlusions and observe a level of performance comparable to the state of the art when explaining images without occlusions.

LGFeb 7, 2021
Neural Termination Analysis

Mirco Giacobbe, Daniel Kroening, Julian Parsert

We introduce a novel approach to the automated termination analysis of computer programs: we use neural networks to represent ranking functions. Ranking functions map program states to values that are bounded from below and decrease as a program runs; the existence of a ranking function proves that the program terminates. We train a neural network from sampled execution traces of a program so that the network's output decreases along the traces; then, we use symbolic reasoning to formally verify that it generalises to all possible executions. Upon the affirmative answer we obtain a formal certificate of termination for the program, which we call a neural ranking function. We demonstrate that thanks to the ability of neural networks to represent nonlinear functions our method succeeds over programs that are beyond the reach of state-of-the-art tools. This includes programs that use disjunctions in their loop conditions and programs that include nonlinear expressions.

AIJan 20, 2021
Shielding Atari Games with Bounded Prescience

Mirco Giacobbe, Mohammadhosein Hasanbeig, Daniel Kroening et al.

Deep reinforcement learning (DRL) is applied in safety-critical domains such as robotics and autonomous driving. It achieves superhuman abilities in many tasks, however whether DRL agents can be shown to act safely is an open problem. Atari games are a simple yet challenging exemplar for evaluating the safety of DRL agents and feature a diverse portfolio of game mechanics. The safety of neural agents has been studied before using methods that either require a model of the system dynamics or an abstraction; unfortunately, these are unsuitable to Atari games because their low-level dynamics are complex and hidden inside their emulator. We present the first exact method for analysing and ensuring the safety of DRL agents for Atari games. Our method only requires access to the emulator. First, we give a set of 43 properties that characterise "safe behaviour" for 30 games. Second, we develop a method for exploring all traces induced by an agent and a game and consider a variety of sources of game non-determinism. We observe that the best available DRL agents reliably satisfy only very few properties; several critical properties are violated by all agents. Finally, we propose a countermeasure that combines a bounded explicit-state exploration with shielding. We demonstrate that our method improves the safety of all agents over multiple properties.

LGAug 31, 2020
Ranking Policy Decisions

Hadrien Pouget, Hana Chockler, Youcheng Sun et al.

Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret. In a run with $n$ time steps, a policy will make $n$ decisions on actions to take; we conjecture that only a small subset of these decisions delivers value over selecting a simple default action. Given a trained policy, we propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states. We argue that among other things, the ranked list of states can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard. As a proxy for quality, we use the ranking to create new, simpler policies from the original ones by pruning decisions identified as unimportant (that is, replacing them by default actions) and measuring the impact on performance. Our experiments on a diverse set of standard benchmarks demonstrate that pruned policies can perform on a level comparable to the original policies. Conversely, we show that naive approaches for ranking policy decisions, e.g., ranking based on the frequency of visiting a state, do not result in high-performing pruned policies.

CRJul 12, 2020
The Taint Rabbit: Optimizing Generic Taint Analysis with Dynamic Fast Path Generation

John Galea, Daniel Kroening

Generic taint analysis is a pivotal technique in software security. However, it suffers from staggeringly high overhead. In this paper, we explore the hypothesis whether just-in-time (JIT) generation of fast paths for tracking taint can enhance the performance. To this end, we present the Taint Rabbit, which supports highly customizable user-defined taint policies and combines a JIT with fast context switching. Our experimental results suggest that this combination outperforms notable existing implementations of generic taint analysis and bridges the performance gap to specialized trackers. For instance, Dytan incurs an average overhead of 237x, while the Taint Rabbit achieves 1.7x on the same set of benchmarks. This compares favorably to the 1.5x overhead delivered by the bitwise, non-generic, taint engine LibDFT.

LGFeb 26, 2020
Cautious Reinforcement Learning with Logical Constraints

Mohammadhosein Hasanbeig, Alessandro Abate, Daniel Kroening

This paper presents the concept of an adaptive safe padding that forces Reinforcement Learning (RL) to synthesise optimal control policies while ensuring safety during the learning process. Policies are synthesised to satisfy a goal, expressed as a temporal logic formula, with maximal probability. Enforcing the RL agent to stay safe during learning might limit the exploration, however we show that the proposed architecture is able to automatically handle the trade-off between efficient progress in exploration (towards goal satisfaction) and ensuring safety. Theoretical guarantees are available on the optimality of the synthesised policies and on the convergence of the learning algorithm. Experimental results are provided to showcase the performance of the proposed method.

CVJan 29, 2020
Evaluating Robustness to Context-Sensitive Feature Perturbations of Different Granularities

Isaac Dunn, Laura Hanu, Hadrien Pouget et al.

We cannot guarantee that training datasets are representative of the distribution of inputs that will be encountered during deployment. So we must have confidence that our models do not over-rely on this assumption. To this end, we introduce a new method that identifies context-sensitive feature perturbations (e.g. shape, location, texture, colour) to the inputs of image classifiers. We produce these changes by performing small adjustments to the activation values of different layers of a trained generative neural network. Perturbing at layers earlier in the generator causes changes to coarser-grained features; perturbations further on cause finer-grained changes. Unsurprisingly, we find that state-of-the-art classifiers are not robust to any such changes. More surprisingly, when it comes to coarse-grained feature changes, we find that adversarial training against pixel-space perturbations is not just unhelpful: it is counterproductive.

LOJan 25, 2020
CounterExample Guided Neural Synthesis

Elizabeth Polgreen, Ralph Abboud, Daniel Kroening

Program synthesis is the generation of a program from a specification. Correct synthesis is difficult, and methods that provide formal guarantees suffer from scalability issues. On the other hand, neural networks are able to generate programs from examples quickly but are unable to guarantee that the program they output actually meets the logical specification. In this work we combine neural networks with formal reasoning: using the latter to convert a logical specification into a sequence of examples that guides the neural network towards a correct solution, and to guarantee that any solution returned satisfies the formal specification. We apply our technique to synthesising loop invariants and compare the performance to existing solvers that use SMT and existing techniques that use neural networks. Our results show that the formal reasoning based guidance improves the performance of the neural network substantially, nearly doubling the number of benchmarks it can solve.

FLJan 15, 2020
Learning Concise Models from Long Execution Traces

Natasha Yogananda Jeppu, Tom Melham, Daniel Kroening et al.

Abstract models of system-level behaviour have applications in design exploration, analysis, testing and verification. We describe a new algorithm for automatically extracting useful models, as automata, from execution traces of a HW/SW system driven by software exercising a use-case of interest. Our algorithm leverages modern program synthesis techniques to generate predicates on automaton edges, succinctly describing system behaviour. It employs trace segmentation to tackle complexity for long traces. We learn concise models capturing transaction-level, system-wide behaviour--experimentally demonstrating the approach using traces from a variety of sources, including the x86 QEMU virtual platform and the Real-Time Linux kernel.

LGNov 22, 2019
DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning

Mohammadhosein Hasanbeig, Natasha Yogananda Jeppu, Alessandro Abate et al.

This paper proposes DeepSynth, a method for effective training of deep Reinforcement Learning (RL) agents when the reward is sparse and non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton so that the generation of a control policy by deep RL is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse non-Markovian rewards. We have evaluated DeepSynth's performance in a set of experiments that includes the Atari game Montezuma's Revenge. Compared to existing approaches, we obtain a reduction of two orders of magnitude in the number of iterations required for policy synthesis, and also a significant improvement in scalability.

LGSep 23, 2019
Modular Deep Reinforcement Learning with Temporal Logic Specifications

Lim Zun Yuan, Mohammadhosein Hasanbeig, Alessandro Abate et al.

We propose an actor-critic, model-free, and online Reinforcement Learning (RL) framework for continuous-state continuous-action Markov Decision Processes (MDPs) when the reward is highly sparse but encompasses a high-level temporal structure. We represent this temporal structure by a finite-state machine and construct an on-the-fly synchronised product with the MDP and the finite machine. The temporal structure acts as a guide for the RL agent within the product, where a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy. We evaluate our framework in a Mars rover experiment and we present the success rate of the synthesised policy.

LOSep 11, 2019
Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

Mohammadhosein Hasanbeig, Yiannis Kantaros, Alessandro Abate et al.

Reinforcement Learning (RL) has emerged as an efficient method of choice for solving complex sequential decision making problems in automatic control, computer science, economics, and biology. In this paper we present a model-free RL algorithm to synthesize control policies that maximize the probability of satisfying high-level control objectives given as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace properties, the structure of the workspace, and the agent actions, giving rise to a Probabilistically-Labeled Markov Decision Process (PL-MDP) with unknown graph structure and stochastic behaviour, which is even more general case than a fully unknown MDP. We first translate the LTL specification into a Limit Deterministic Buchi Automaton (LDBA), which is then used in an on-the-fly product with the PL-MDP. Thereafter, we define a synchronous reward function based on the acceptance condition of the LDBA. Finally, we show that the RL algorithm delivers a policy that maximizes the satisfaction probability asymptotically. We provide experimental results that showcase the efficiency of the proposed method.

LGAug 6, 2019
Explaining Image Classifiers using Statistical Fault Localization

Youcheng Sun, Hana Chockler, Xiaowei Huang et al.

The black-box nature of deep neural networks (DNNs) makes it impossible to understand why a particular output is produced, creating demand for "Explainable AI". In this paper, we show that statistical fault localization (SFL) techniques from software engineering deliver high quality explanations of the outputs of DNNs, where we define an explanation as a minimal subset of features sufficient for making the same decision as for the original input. We present an algorithm and a tool called DeepCover, which synthesizes a ranking of the features of the inputs using SFL and constructs explanations for the decisions of the DNN based on this ranking. We compare explanations produced by DeepCover with those of the state-of-the-art tools GradCAM, LIME, SHAP, RISE and Extremal and show that explanations generated by DeepCover are consistently better across a broad set of experiments. On a benchmark set with known ground truth, DeepCover achieves 76.7% accuracy, which is 6% better than the second best Extremal.

LGMay 7, 2019
Adaptive Generation of Unrestricted Adversarial Inputs

Isaac Dunn, Hadrien Pouget, Tom Melham et al.

Neural networks are vulnerable to adversarially-constructed perturbations of their inputs. Most research so far has considered perturbations of a fixed magnitude under some $l_p$ norm. Although studying these attacks is valuable, there has been increasing interest in the construction of (and robustness to) unrestricted attacks, which are not constrained to a small and rather artificial subset of all possible adversarial inputs. We introduce a novel algorithm for generating such unrestricted adversarial inputs which, unlike prior work, is adaptive: it is able to tune its attacks to the classifier being targeted. It also offers a 400-2,000x speedup over the existing state of the art. We demonstrate our approach by generating unrestricted adversarial inputs that fool classifiers robust to perturbation-based attacks. We also show that, by virtue of being adaptive and unrestricted, our attack is able to defeat adversarial training against it.

LGFeb 2, 2019
Certified Reinforcement Learning with Logic Guidance

Hosein Hasanbeig, Daniel Kroening, Alessandro Abate

Reinforcement Learning (RL) is a widely employed machine learning architecture that has been applied to a variety of control problems. However, applications in safety-critical domains require a systematic and formal approach to specifying requirements as tasks or goals. We propose a model-free RL algorithm that enables the use of Linear Temporal Logic (LTL) to formulate a goal for unknown continuous-state/action Markov Decision Processes (MDPs). The given LTL property is translated into a Limit-Deterministic Generalised Buchi Automaton (LDGBA), which is then used to shape a synchronous reward function on-the-fly. Under certain assumptions, the algorithm is guaranteed to synthesise a control policy whose traces satisfy the LTL specification with maximal probability.

LGDec 18, 2018
A Survey of Safety and Trustworthiness of Deep Neural Networks: Verification, Testing, Adversarial Attack and Defence, and Interpretability

Xiaowei Huang, Daniel Kroening, Wenjie Ruan et al.

In the past few years, significant progress has been made on deep neural networks (DNNs) in achieving human-level performance on several long-standing tasks. With the broader deployment of DNNs on various applications, the concerns over their safety and trustworthiness have been raised in public, especially after the widely reported fatal incidents involving self-driving cars. Research to address these concerns is particularly active, with a significant number of papers released in the past few years. This survey paper conducts a review of the current research effort into making DNNs safe and trustworthy, by focusing on four aspects: verification, testing, adversarial attack and defence, and interpretability. In total, we survey 202 papers, most of which were published after 2017.

LGSep 20, 2018
Logically-Constrained Neural Fitted Q-Iteration

Mohammadhosein Hasanbeig, Alessandro Abate, Daniel Kroening

We propose a method for efficient training of Q-functions for continuous-state Markov Decision Processes (MDPs) such that the traces of the resulting policies satisfy a given Linear Temporal Logic (LTL) property. LTL, a modal logic, can express a wide range of time-dependent logical properties (including "safety") that are quite similar to patterns in natural language. We convert the LTL property into a limit deterministic Buchi automaton and construct an on-the-fly synchronised product MDP. The control policy is then synthesised by defining an adaptive reward function and by applying a modified neural fitted Q-iteration algorithm to the synchronised structure, assuming that no prior knowledge is available from the original MDP. The proposed method is evaluated in a numerical study to test the quality of the generated control policy and is compared with conventional methods for policy synthesis such as MDP abstraction (Voronoi quantizer) and approximate dynamic programming (fitted value iteration).

LOSep 11, 2018
Benchmarking of Java Verification Tools at the Software Verification Competition (SV-COMP)

Lucas Cordeiro, Daniel Kroening, Peter Schrammel

Empirical evaluation of verification tools by benchmarking is a common method in software verification research. The Competition on Software Verification (SV-COMP) aims at standardization and reproducibility of benchmarking within the software verification community on an annual basis, through comparative evaluation of fully automatic software verifiers for C programs. Building upon this success, here we describe how to re-use the ecosystem developed around SV-COMP for benchmarking Java verification tools. We provide a detailed description of the rules for benchmark verification tasks, the integration of new tools into SV-COMP's benchmarking framework and also give experimental results of a benchmarking run on state-of-the-art Java verification tools, JPF, SPF, JayHorn and JBMC.

SEMay 9, 2018
Evaluating Manual Intervention to Address the Challenges of Bug Finding with KLEE

John Galea, Sean Heelan, Daniel Neville et al.

Symbolic execution has shown its ability to find security-relevant flaws in software, but faces significant scalability challenges. There is a commonly held belief that manual intervention by an expert can help alleviate these limiting factors. However, there has been little formal investigation of this idea. In this paper, we present our experiences applying the KLEE symbolic execution engine to a new bug corpus, and of using manual intervention to alleviate the issues encountered. Our contributions are (1) Hemiptera, a novel corpus of over 130 bugs in real world software, (2) a comprehensive evaluation of the KLEE symbolic execution engine on Hemiptera with a categorisation of frequently occurring software patterns that are problematic for symbolic execution, and (3) an evaluation of manual mitigations aimed at addressing the underlying issues of symbolic execution. Our experience shows that manual intervention can increase both code coverage and bug detection in many situations. It is not a silver bullet however, and we discuss its limitations and the challenges encountered.

LGApr 30, 2018
Concolic Testing for Deep Neural Networks

Youcheng Sun, Min Wu, Wenjie Ruan et al.

Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. This paper presents the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we formalise coverage criteria for DNNs that have been studied in the literature, and then develop a coherent method for performing concolic testing to increase test coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples.

CRApr 23, 2018
Automatic Heap Layout Manipulation for Exploitation

Sean Heelan, Tom Melham, Daniel Kroening

Heap layout manipulation is integral to exploiting heap-based memory corruption vulnerabilities. In this paper we present the first automatic approach to the problem, based on pseudo-random black-box search. Our approach searches for the inputs required to place the source of a heap-based buffer overflow or underflow next to heap-allocated objects that an exploit developer, or automatic exploit generation system, wishes to read or corrupt. We present a framework for benchmarking heap layout manipulation algorithms, and use it to evaluate our approach on several real-world allocators, showing that pseudo-random black box search can be highly effective. We then present SHRIKE, a novel system that can perform automatic heap layout manipulation on the PHP interpreter and can be used in the construction of control-flow hijacking exploits. Starting from PHP's regression tests, SHRIKE discovers fragments of PHP code that interact with the interpreter's heap in useful ways, such as making allocations and deallocations of particular sizes, or allocating objects containing sensitive data, such as pointers. SHRIKE then uses our search algorithm to piece together these fragments into programs, searching for one that achieves a desired heap layout. SHRIKE allows an exploit developer to focus on the higher level concepts in an exploit, and to defer the resolution of heap layout constraints to SHRIKE. We demonstrate this by using SHRIKE in the construction of a control-flow hijacking exploit for the PHP interpreter.

LGApr 16, 2018
Global Robustness Evaluation of Deep Neural Networks with Provable Guarantees for the $L_0$ Norm

Wenjie Ruan, Min Wu, Youcheng Sun et al.

Deployment of deep neural networks (DNNs) in safety- or security-critical systems requires provable guarantees on their correct behaviour. A common requirement is robustness to adversarial perturbations in a neighbourhood around an input. In this paper we focus on the $L_0$ norm and aim to compute, for a trained DNN and an input, the maximal radius of a safe norm ball around the input within which there are no adversarial examples. Then we define global robustness as an expectation of the maximal safe radius over a test data set. We first show that the problem is NP-hard, and then propose an approximate approach to iteratively compute lower and upper bounds on the network's robustness. The approach is \emph{anytime}, i.e., it returns intermediate bounds and robustness estimates that are gradually, but strictly, improved as the computation proceeds; \emph{tensor-based}, i.e., the computation is conducted over a set of inputs simultaneously, instead of one by one, to enable efficient GPU computation; and has \emph{provable guarantees}, i.e., both the bounds and the robustness estimates can converge to their optimal values. Finally, we demonstrate the utility of the proposed approach in practice to compute tight bounds by applying and adapting the anytime algorithm to a set of challenging problems, including global robustness evaluation, competitive $L_0$ attacks, test case generation for DNNs, and local robustness evaluation on large-scale ImageNet DNNs. We release the code of all case studies via GitHub.

LGJan 24, 2018
Logically-Constrained Reinforcement Learning

Mohammadhosein Hasanbeig, Alessandro Abate, Daniel Kroening

We present the first model-free Reinforcement Learning (RL) algorithm to synthesise policies for an unknown Markov Decision Process (MDP), such that a linear time property is satisfied. The given temporal property is converted into a Limit Deterministic Buchi Automaton (LDBA) and a robust reward function is defined over the state-action pairs of the MDP according to the resulting LDBA. With this reward function, the policy synthesis procedure is "constrained" by the given specification. These constraints guide the MDP exploration so as to minimize the solution time by only considering the portion of the MDP that is relevant to satisfaction of the LTL property. This improves performance and scalability of the proposed method by avoiding an exhaustive update over the whole state space while the efficiency of standard methods such as dynamic programming is hindered by excessive memory requirements, caused by the need to store a full-model in memory. Additionally, we show that the RL procedure sets up a local value iteration method to efficiently calculate the maximum probability of satisfying the given property, at any given state of the MDP. We prove that our algorithm is guaranteed to find a policy whose traces probabilistically satisfy the LTL property if such a policy exists, and additionally we show that our method produces reasonable control policies even when the LTL property cannot be satisfied. The performance of the algorithm is evaluated via a set of numerical examples. We observe an improvement of one order of magnitude in the number of iterations required for the synthesis compared to existing approaches.

SEJul 5, 2017
Functional Requirements-Based Automated Testing for Avionics

Youcheng Sun, Martin Brain, Daniel Kroening et al.

We propose and demonstrate a method for the reduction of testing effort in safety-critical software development using DO-178 guidance. We achieve this through the application of Bounded Model Checking (BMC) to formal low-level requirements, in order to generate tests automatically that are good enough to replace existing labor-intensive test writing procedures while maintaining independence from implementation artefacts. Given that existing manual processes are often empirical and subjective, we begin by formally defining a metric, which extends recognized best practice from code coverage analysis strategies to generate tests that adequately cover the requirements. We then formulate the automated test generation procedure and apply its prototype in case studies with industrial partners. In review, the method developed here is demonstrated to significantly reduce the human effort for the qualification of software products under DO-178 guidance.

SYAug 22, 2017
Unbounded-Time Analysis of Guarded LTI Systems with Inputs by Abstract Acceleration (extended version)

Dario Cattaruzza, Alessandro Abate, Peter Schrammel et al.

Linear Time Invariant (LTI) systems are ubiquitous in control applications. Unbounded-time reachability analysis that can cope with industrial-scale models with thousands of variables is needed. To tackle this problem, we use abstract acceleration, a method for unbounded-time polyhedral reachability analysis for linear systems. Existing variants of the method are restricted to closed systems, i.e., dynamical models without inputs or non-determinism. In this paper, we present an extension of abstract acceleration to linear loops with inputs, which correspond to discrete-time LTI control systems under guard conditions. The new method relies on a relaxation of the solution of the linear dynamical equation that leads to a precise over-approximation of the set of reachable states, which are evaluated using support functions. In order to increase scalability, we use floating-point computations and ensure soundness by interval arithmetic. Our experiments show that performance increases by several orders of magnitude over alternative approaches in the literature. In turn, this tremendous gain allows us to improve on precision by computing more expensive abstractions. We outperform state-of-the-art tools for unbounded-time analysis of LTI system with inputs in speed as well as in precision.

SYJun 9, 2017
DSValidator: An Automated Counterexample Reproducibility Tool for Digital Systems (Tool Demonstration)

Lennon Chaves, Iury Bessa, Lucas Cordeiro et al.

An automated counterexample reproducibility tool based on MATLAB is presented, called DSValidator, with the goal of reproducing counterexamples that refute specific properties related to digital systems. We exploit counterexamples generated by the Digital System Verifier (DSVerifier), which is a model checking tool based on satisfiability modulo theories for digital systems. DSValidator reproduces the execution of a digital system, relating its input with the counterexample, in order to establish trust in a verification result. We show that DSValidator can validate a set of intricate counterexamples for digital controllers used in a real quadrotor attitude system within seconds and also expose incorrect verification results in DSVerifier. The resulting toolbox leverages the potential of combining different verification tools for validating digital systems via an exchangeable counterexample format.

LOOct 10, 2016
Verification of the Tree-Based Hierarchical Read-Copy Update in the Linux Kernel

Lihao Liang, Paul E. McKenney, Daniel Kroening et al.

Read-Copy Update (RCU) is a scalable, high-performance Linux-kernel synchronization mechanism that runs low-overhead readers concurrently with updaters. Production-quality RCU implementations for multi-core systems are decidedly non-trivial. Giving the ubiquity of Linux, a rare "million-year" bug can occur several times per day across the installed base. Stringent validation of RCU's complex behaviors is thus critically important. Exhaustive testing is infeasible due to the exponential number of possible executions, which suggests use of formal verification. Previous verification efforts on RCU either focus on simple implementations or use modeling languages, the latter requiring error-prone manual translation that must be repeated frequently due to regular changes in the Linux kernel's RCU implementation. In this paper, we first describe the implementation of Tree RCU in the Linux kernel. We then discuss how to construct a model directly from Tree RCU's source code in C, and use the CBMC model checker to verify its safety and liveness properties. To our best knowledge, this is the first verification of a significant part of RCU's source code, and is an important step towards integration of formal verification into the Linux kernel's regression test suite.

SEFeb 26, 2016
The Virtues of Conflict: Analyzing Modern Concurrency

Ganesh Narayanaswamy, Saurabh Joshi, Daniel Kroening

Modern shared memory multiprocessors permit reordering of memory operations for performance reasons. These reorderings are often a source of subtle bugs in programs written for such architectures. Traditional approaches to verify weak memory programs often rely on interleaving semantics, which is prone to state space explosion, and thus severely limits the scalability of the analysis. In recent times, there has been a renewed interest in modelling dynamic executions of weak memory programs using partial orders. However, such an approach typically requires ad-hoc mechanisms to correctly capture the data and control-flow choices/conflicts present in real-world programs. In this work, we propose a novel, conflict-aware, composable, truly concurrent semantics for programs written using C/C++ for modern weak memory architectures. We exploit our symbolic semantics based on general event structures to build an efficient decision procedure that detects assertion violations in bounded multi-threaded programs. Using a large, representative set of benchmarks, we show that our conflict-aware semantics outperforms the state-of-the-art partial-order based approaches.

SESep 15, 2015
Assisted Coverage Closure

Adam Nellis, Pascal Kesseli, Philippa Ryan Conmy et al.

The malfunction of safety-critical systems may cause damage to people and the environment. Software within those systems is rigorously designed and verified according to domain specific guidance, such as ISO26262 for automotive safety. This paper describes academic and industrial co-operation in tool development to support one of the most stringent of the requirements --- achieving full code coverage in requirements-driven testing. We present a verification workflow supported by a tool that integrates the coverage measurement tool RapiCover with the test-vector generator FShell. The tool assists closing the coverage gap by providing the engineer with test vectors that help in debugging coverage-related code quality issues and creating new test cases, as well as justifying the presence of unreachable parts of the code in order to finally achieve full effective coverage according to the required criteria. To illustrate the practical utility of the tool, we report about an application of the tool to a case study from automotive industry.

LOJun 18, 2015
Safety Verification and Refutation by k-invariants and k-induction (extended version)

Martin Brain, Saurabh Joshi, Daniel Kroening et al.

Most software verification tools can be classified into one of a number of established families, each of which has their own focus and strengths. For example, concrete counterexample generation in model checking, invariant inference in abstract interpretation and completeness via annotation for deductive verification. This creates a significant and fundamental usability problem as users may have to learn and use one technique to find potential problems but then need an entirely different one to show that they have been fixed. This paper presents a single, unified algorithm kIkI, which strictly generalises abstract interpretation, bounded model checking and k-induction. This not only combines the strengths of these techniques but allows them to interact and reinforce each other, giving a `single-tool' approach to verification.

SEMay 18, 2015
Synthesising Interprocedural Bit-Precise Termination Proofs (extended version)

Hong-Yi Chen, Cristina David, Daniel Kroening et al.

Proving program termination is key to guaranteeing absence of undesirable behaviour, such as hanging programs and even security vulnerabilities such as denial-of-service attacks. To make termination checks scale to large systems, interprocedural termination analysis seems essential, which is a largely unexplored area of research in termination analysis, where most effort has focussed on difficult single-procedure problems. We present a modular termination analysis for C programs using template-based interprocedural summarisation. Our analysis combines a context-sensitive, over-approximating forward analysis with the inference of under-approximating preconditions for termination. Bit-precise termination arguments are synthesised over lexicographic linear ranking function templates. Our experimental results show that our tool 2LS outperforms state-of-the-art alternatives, and demonstrate the clear advantage of interprocedural reasoning over monolithic analysis in terms of efficiency, while retaining comparable precision.