38.9SEApr 28
RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL RequirementsLeon Kogler, Stefan Hangler, Maximilian Ehrhart et al.
Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.
SENov 1, 2020
How Testable is Business Software?Peter Schrammel
Most businesses rely on a significant stack of software to perform their daily operations. This software is business-critical as defects in this software have major impacts on revenue and customer satisfaction. The primary means for verification of this software is testing. We conducted an extensive analysis of Java software packages to evaluate their unit-testability. The results show that code in software repositories is typically split into portions of very trivial code, non-trivial code that is unit-testable, and code that cannot be unit-tested easily. This brings up interesting considerations regarding the use of test coverage metrics and design for testability, which is crucial for testing efficiency and effectiveness. Lack of unit-testability is an obstacle to applying tools that perform automated verification and test generation. These tools cannot make up for poor testability of the code and have a hard time in succeeding or are not even applicable without first improving the design of the software system.
SEMar 2, 2019
2LS: Heap Analysis and Memory Safety (Competition Contribution)Viktor Malik, Martin Hruska, Peter Schrammel et al.
2LS is a framework for analysis of sequential C programs that can verify and refute program assertions and termination. The 2LS framework is built upon the CPROVER infrastructure and implements template-based synthesis techniques, e.g. to find invariants and ranking functions, and incremental loop unwinding techniques to find counterexamples and k-induction proofs. The main improvements in this year's version are the ability of 2LS to analyse programs requiring combined reasoning about shape and content of dynamic data structures, and an instrumentation for memory safety properties.
LOSep 11, 2018
Benchmarking of Java Verification Tools at the Software Verification Competition (SV-COMP)Lucas Cordeiro, Daniel Kroening, Peter Schrammel
Empirical evaluation of verification tools by benchmarking is a common method in software verification research. The Competition on Software Verification (SV-COMP) aims at standardization and reproducibility of benchmarking within the software verification community on an annual basis, through comparative evaluation of fully automatic software verifiers for C programs. Building upon this success, here we describe how to re-use the ecosystem developed around SV-COMP for benchmarking Java verification tools. We provide a detailed description of the rules for benchmark verification tasks, the integration of new tools into SV-COMP's benchmarking framework and also give experimental results of a benchmarking run on state-of-the-art Java verification tools, JPF, SPF, JayHorn and JBMC.
SYAug 22, 2017
Unbounded-Time Analysis of Guarded LTI Systems with Inputs by Abstract Acceleration (extended version)Dario Cattaruzza, Alessandro Abate, Peter Schrammel et al.
Linear Time Invariant (LTI) systems are ubiquitous in control applications. Unbounded-time reachability analysis that can cope with industrial-scale models with thousands of variables is needed. To tackle this problem, we use abstract acceleration, a method for unbounded-time polyhedral reachability analysis for linear systems. Existing variants of the method are restricted to closed systems, i.e., dynamical models without inputs or non-determinism. In this paper, we present an extension of abstract acceleration to linear loops with inputs, which correspond to discrete-time LTI control systems under guard conditions. The new method relies on a relaxation of the solution of the linear dynamical equation that leads to a precise over-approximation of the set of reachable states, which are evaluated using support functions. In order to increase scalability, we use floating-point computations and ensure soundness by interval arithmetic. Our experiments show that performance increases by several orders of magnitude over alternative approaches in the literature. In turn, this tremendous gain allows us to improve on precision by computing more expensive abstractions. We outperform state-of-the-art tools for unbounded-time analysis of LTI system with inputs in speed as well as in precision.
LOJul 15, 2016
Challenges in Decomposing Encodings of Verification ProblemsPeter Schrammel
Modern program verifiers use logic-based encodings of the verification problem that are discharged by a back end reasoning engine. However, instances of such encodings for large programs can quickly overwhelm these back end solvers. Hence, we need techniques to make the solving process scale to large systems, such as partitioning (divide-and-conquer) and abstraction. In recent work, we showed how decomposing the formula encoding of a termination analysis can significantly increase efficiency. The analysis generates a sequence of logical formulas with existentially quantified predicates that are solved by a synthesis-based program analysis engine. However, decomposition introduces abstractions in addition to those required for finding the unknown predicates in the formula, and can hence deteriorate precision. We discuss the challenges associated with such decompositions and their interdependencies with the solving process.
SESep 15, 2015
Assisted Coverage ClosureAdam Nellis, Pascal Kesseli, Philippa Ryan Conmy et al.
The malfunction of safety-critical systems may cause damage to people and the environment. Software within those systems is rigorously designed and verified according to domain specific guidance, such as ISO26262 for automotive safety. This paper describes academic and industrial co-operation in tool development to support one of the most stringent of the requirements --- achieving full code coverage in requirements-driven testing. We present a verification workflow supported by a tool that integrates the coverage measurement tool RapiCover with the test-vector generator FShell. The tool assists closing the coverage gap by providing the engineer with test vectors that help in debugging coverage-related code quality issues and creating new test cases, as well as justifying the presence of unreachable parts of the code in order to finally achieve full effective coverage according to the required criteria. To illustrate the practical utility of the tool, we report about an application of the tool to a case study from automotive industry.
LOJun 18, 2015
Safety Verification and Refutation by k-invariants and k-induction (extended version)Martin Brain, Saurabh Joshi, Daniel Kroening et al.
Most software verification tools can be classified into one of a number of established families, each of which has their own focus and strengths. For example, concrete counterexample generation in model checking, invariant inference in abstract interpretation and completeness via annotation for deductive verification. This creates a significant and fundamental usability problem as users may have to learn and use one technique to find potential problems but then need an entirely different one to show that they have been fixed. This paper presents a single, unified algorithm kIkI, which strictly generalises abstract interpretation, bounded model checking and k-induction. This not only combines the strengths of these techniques but allows them to interact and reinforce each other, giving a `single-tool' approach to verification.
SEMay 18, 2015
Synthesising Interprocedural Bit-Precise Termination Proofs (extended version)Hong-Yi Chen, Cristina David, Daniel Kroening et al.
Proving program termination is key to guaranteeing absence of undesirable behaviour, such as hanging programs and even security vulnerabilities such as denial-of-service attacks. To make termination checks scale to large systems, interprocedural termination analysis seems essential, which is a largely unexplored area of research in termination analysis, where most effort has focussed on difficult single-procedure problems. We present a modular termination analysis for C programs using template-based interprocedural summarisation. Our analysis combines a context-sensitive, over-approximating forward analysis with the inference of under-approximating preconditions for termination. Bit-precise termination arguments are synthesised over lexicographic linear ranking function templates. Our experimental results show that our tool 2LS outperforms state-of-the-art alternatives, and demonstrate the clear advantage of interprocedural reasoning over monolithic analysis in terms of efficiency, while retaining comparable precision.
SESep 20, 2014
Incremental Bounded Model Checking for Embedded Software (extended version)Peter Schrammel, Daniel Kroening, Martin Brain et al.
Program analysis is on the brink of mainstream in embedded systems development. Formal verification of behavioural requirements, finding runtime errors and automated test case generation are some of the most common applications of automated verification tools based on Bounded Model Checking. Existing industrial tools for embedded software use an off-the-shelf Bounded Model Checker and apply it iteratively to verify the program with an increasing number of unwindings. This approach unnecessarily wastes time repeating work that has already been done and fails to exploit the power of incremental SAT solving. This paper reports on the extension of the software model checker CBMC to support incremental Bounded Model Checking and its successful integration with the industrial embedded software verification tool BTC EmbeddedTester. We present an extensive evaluation over large industrial embedded programs, which shows that incremental Bounded Model Checking cuts runtimes by one order of magnitude in comparison to the standard non-incremental approach, enabling the application of formal verification to large and complex embedded software.
SEJun 14, 2013
Chaining Test Cases for Reactive System Testing (extended version)Peter Schrammel, Tom Melham, Daniel Kroening
Testing of synchronous reactive systems is challenging because long input sequences are often needed to drive them into a state at which a desired feature can be tested. This is particularly problematic in on-target testing, where a system is tested in its real-life application environment and the time required for resetting is high. This paper presents an approach to discovering a test case chain---a single software execution that covers a group of test goals and minimises overall test execution time. Our technique targets the scenario in which test goals for the requirements are given as safety properties. We give conditions for the existence and minimality of a single test case chain and minimise the number of test chains if a single test chain is infeasible. We report experimental results with a prototype tool for C code generated from Simulink models and compare it to state-of-the-art test suite generators.