SEDec 4, 2016Code
Implementing and Evaluating Candidate-Based Invariant GenerationAdam Betts, Nathan Chong, Pantazis Deligiannis et al.
The discovery of inductive invariants lies at the heart of static program verification. Presently, many automatic solutions to inductive invariant generation are inflexible, only applicable to certain classes of programs, or unpredictable. An automatic technique that circumvents these deficiencies to some extent is candidate-based invariant generation. This paper describes our efforts to apply candidate-based invariant generation in GPUVerify, a static checker for programs that run on GPUs. We study a set of GPU programs that contain loops, drawn from a number of open source suites and vendor SDKs. We describe the methodology we used to incrementally improve the invariant generation capabilities of GPUVerify to handle these benchmarks, through candidate-based invariant generation, using cheap static analysis to speculate potential program invariants. We also describe a set of experiments that we used to examine the effectiveness of our rules for candidate generation, assessing rules based on their generality (the extent to which they generate candidate invariants), hit rate (the extent to which the generated candidates hold), worth (the extent to which provable candidates actually help in allowing verification to succeed), and influence (the extent to which the success of one generation rule depends on candidates generated by another rule). The candidates produced by GPUVerify help to verify 231 of the 253 programs. This increase in precision, however, makes GPUVerify sluggish: the more candidates that are generated, the more time is spent determining which are inductive invariants. To speed up this process, we have investigated four under-approximating program analyses that aim to reject false candidates quickly and a framework whereby these analyses can run in sequence or in parallel.
SEFeb 5
Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving TransformationsShahin Honarvar, Amber Gorzynski, James Lee-Jones et al.
Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed light on the robustness and generalisation abilities of agents across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used as the basis for generating a family of semantically-equivalent challenges via semantics-preserving program transformations. This enables controlled evaluation of agent robustness to source code transformations while keeping the underlying exploit strategy fixed. We introduce a new tool, Evolve-CTF, that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to intrusive renaming and code insertion-based transformations, but that composed transformations and deeper obfuscation affect performance by requiring more sophisticated use of tools. We also find that enabling explicit reasoning has little effect on solution success rates across challenge families. Our work contributes a valuable technique and tool for future LLM evaluations, and a large dataset characterising the capabilities of current state-of-the-art models in this domain.
HCJul 7, 2020
A report on the first virtual PLDI conferenceAlastair F. Donaldson
This is a report on the PLDI 2020 conference, for which I was General Chair, which was held virtually for the first time as a result of the COVID-19 pandemic. The report contains: my personal reflections on the positive and negative aspects of the event; a description of the format of the event and associated logistical details; and data (with some analysis) on attendees' views on the conference format, the extent to which attendees engaged with the conference, attendees' views on virtual vs. physical conferences (with a focus on PLDI specifically) and the diversity of conference registrants. I hope that the report will be a useful resource for organizers of upcoming virtual conferences, and generally interesting to the Programming Languages community and beyond.
SEFeb 25, 2019
A Systematic Impact Study for Fuzzer-Found Compiler BugsMichaël Marcozzi, Qiyi Tang, Alastair F. Donaldson et al.
Despite much recent interest in compiler randomized testing (fuzzing), the practical impact of fuzzer-found compiler bugs on real-world applications has barely been assessed. We present the first quantitative and qualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorous methodology where the bug impact over the compiled application is evaluated based on (1) whether the bug appears to trigger during compilation; (2) the extent to which generated assembly code changes syntactically due to triggering of the bug; and (3) how much such changes do cause regression test suite failures and could be used to manually trigger divergences during execution. The study is conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debian packages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzers in the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or by formal verification. The results show that almost half of the fuzzer-found bugs propagate to the generated binaries for some packages, but rarely affect their syntax and cause two failures in total when running their test suites. User-reported and formal verification bugs do not exhibit a higher impact, with less frequently triggered bugs and one test failure. Our manual analysis of a selection of bugs, either fuzzer-found or not, suggests that none can easily trigger a runtime divergence on the packages considered in the analysis, and that in general they affect only corner cases.
SENov 9, 2015
Integrating a large-scale testing campaign in the CK frameworkAndrei Lascu, Alastair F. Donaldson
We consider the problem of conducting large experimental campaigns in programming languages research. Most research efforts require a certain level of bookkeeping of results. This is manageable via quick, on-the-fly infrastructure implementations. However, it becomes a problem for large-scale testing initiatives, especially as the needs of the project evolve along the way. We look at how the Collective Knowledge generalized testing framework can help with such a project and its overall applicability and ease of use. The project in question is an OpenCL compiler testing campaign. We investigate how to use the Collective Knowledge framework to lead the experimental campaign, by providing storage and representation of test cases and their results. We also provide an initial implementation, publicly available.