Alastair Donaldson

h-index27

2papers

2 Papers

NANov 25, 2016

Certified Roundoff Error Bounds Using Semidefinite Programming

Victor Magron, George Constantinides, Alastair Donaldson

Roundoff errors cannot be avoided when implementing numerical programs with finite precision. The ability to reason about rounding is especially important if one wants to explore a range of potential representations, for instance for FPGAs or custom hardware implementations. This problem becomes challenging when the program does not employ solely linear operations, and non-linearities are inherent to many interesting computational problems in real-world applications. Existing solutions to reasoning possibly lead to either inaccurate bounds or high analysis time in the presence of nonlinear correlations between variables. Furthermore, while it is easy to implement a straightforward method such as interval arithmetic, sophisticated techniques are less straightforward to implement in a formal setting. Thus there is a need for methods which output certificates that can be formally validated inside a proof assistant. We present a framework to provide upper bounds on absolute roundoff errors of floating-point nonlinear programs. This framework is based on optimization techniques employing semidefinite programming and sums of squares certificates, which can be checked inside the Coq theorem prover to provide formal roundoff error bounds for polynomial programs. Our tool covers a wide range of nonlinear programs, including polynomials and transcendental operations as well as conditional statements. We illustrate the efficiency and precision of this tool on non-trivial programs coming from biology, optimization and space control. Our tool produces more accurate error bounds for 23% of all programs and yields better performance in 66% of all programs.

SEDec 22, 2023

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

Shahin Honarvar, Mark van der Wilk, Alastair Donaldson

We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results.