Toby Murray

PL
h-index12
15papers
150citations
Novelty53%
AI Score48

15 Papers

CRDec 16, 2025Code
UIXPOSE: Mobile Malware Detection via Intention-Behaviour Discrepancy Analysis

Amirmohammad Pasdar, Toby Murray, Van-Thuan Pham

We introduce UIXPOSE, a source-code-agnostic framework that operates on both compiled and open-source apps. This framework applies Intention Behaviour Alignment (IBA) to mobile malware analysis, aligning UI-inferred intent with runtime semantics. Previous work either infers intent statically, e.g., permission-centric, or widget-level or monitors coarse dynamic signals (endpoints, partial resource usage) that miss content and context. UIXPOSE infers an intent vector from each screen using vision-language models and knowledge structures and combines decoded network payloads, heap/memory signals, and resource utilisation traces into a behaviour vector. Their alignment, calculated at runtime, can both detect misbehaviour and highlight exploration of behaviourally rich paths. In three real-world case studies, UIXPOSE reveals covert exfiltration and hidden background activity that evade metadata-only baselines, demonstrating how IBA improves dynamic detection.

PLFeb 22, 2025
Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

Thanh Le-Cong, Bach Le, Toby Murray

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.

PLMar 28
Bit-Vector CHC Solving for Binary Analysis and Binary Analysis for Bit-Vector CHC Solving

Aaron Bembenek, Toby Murray

For high-assurance software, source-level reasoning is insufficient: we need binary-level guarantees. Despite constrained Horn clause (CHC) solving being one of the most popular forms of automated verification, prior work has not evaluated the viability of CHC solving for binary analysis. To fill this gap, we assemble a pipeline that encodes binary analysis problems as CHCs in the SMT logic of quantifier-free bit vectors, and show that off-the-shelf CHC solvers achieve reasonable success on binaries compiled from 983 C invariant inference benchmarks: a portfolio solves 59.5% and 66.1% of the problems derived from the unoptimized and optimized binaries, respectively -- roughly equal to the success rate of a leading C verifier on the source code (60.1%). Moreover, we show that binary analysis provides a valuable source of bit-vector CHC benchmarks (which are in short supply): binary-derived problems differ from existing benchmarks both structurally and in solver success rates and rankings. Augmenting CHC solving competitions with binary-derived benchmarks will encourage solver developers to improve bit-vector reasoning, in turn making CHC solving a more effective tool for binary analysis.

LGMar 6
Lipschitz-Based Robustness Certification Under Floating-Point Execution

Toby Murray

Sensitivity-based robustness certification has emerged as a practical approach for certifying neural network robustness, including in settings that require verifiable guarantees. A key advantage of these methods is that certification is performed by concrete numerical computation (rather than symbolic reasoning) and scales efficiently with network size. However, as with the vast majority of prior work on robustness certification and verification, the soundness of these methods is typically proved with respect to a semantic model that assumes exact real arithmetic. In reality deployed neural network implementations execute using floating-point arithmetic. This mismatch creates a semantic gap between certified robustness properties and the behaviour of the executed system. As motivating evidence, we exhibit concrete counterexamples showing that real arithmetic robustness guarantees can fail under floating-point execution, even for previously verified certifiers, with discrepancies becoming pronounced at lower-precision formats such as float16. We then develop a formal, compositional theory relating real arithmetic Lipschitz-based sensitivity bounds to the sensitivity of floating-point execution under standard rounding-error models, specialised to feed-forward neural networks with ReLU activations. We derive sound conditions for robustness under floating-point execution, including bounds on certificate degradation and sufficient conditions for the absence of overflow. We formalize the theory and its main soundness results, and implement an executable certifier based on these principles, which we empirically evaluate to demonstrate its practicality.

SEFeb 19, 2024
Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing

Thanh Le-Cong, Dat Nguyen, Bach Le et al.

In this paper, we propose shifting the focus of robustness evaluation for Neural Program Repair (NPR) techniques toward naturally-occurring data transformations. To accomplish this, we first examine the naturalness of semantic-preserving transformations through a two-stage human study. This study includes (1) interviews with senior software developers to establish concrete criteria for evaluating the naturalness of these transformations, and (2) a survey involving 10 developers to assess the naturalness of 1,178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings show that only 60% of these transformations are deemed natural, while 20% are considered unnatural, with strong agreement among annotators. Moreover, the unnaturalness of these transformations significantly impacts both their applicability to benchmarks and the conclusions drawn from robustness testing. Next, we conduct natural robustness testing on NPR techniques to assess their true effectiveness against real-world data variations. Our experimental results reveal a substantial number of prediction changes in NPR techniques, leading to significant reductions in both plausible and correct patch rates when comparing performance on the original and transformed datasets. Additionally, we observe notable differences in performance improvements between NPR techniques, suggesting potential biases on NPR evaluation introduced by limited datasets. Finally, we propose an LLM-based metric to automate the assessment of transformation naturalness, ensuring the scalability of natural robustness testing.

LGFeb 6, 2024
Symbol Correctness in Deep Neural Networks Containing Symbolic Layers

Aaron Bembenek, Toby Murray

To handle AI tasks that combine perception and logical reasoning, recent work introduces Neurosymbolic Deep Neural Networks (NS-DNNs), which contain -- in addition to traditional neural layers -- symbolic layers: symbolic expressions (e.g., SAT formulas, logic programs) that are evaluated by symbolic solvers during inference. We identify and formalize an intuitive, high-level principle that can guide the design and analysis of NS-DNNs: symbol correctness, the correctness of the intermediate symbols predicted by the neural layers with respect to a (generally unknown) ground-truth symbolic representation of the input data. We demonstrate that symbol correctness is a necessary property for NS-DNN explainability and transfer learning (despite being in general impossible to train for). Moreover, we show that the framework of symbol correctness provides a precise way to reason and communicate about model behavior at neural-symbolic boundaries, and gives insight into the fundamental tradeoffs faced by NS-DNN training algorithms. In doing so, we both identify significant points of ambiguity in prior work, and provide a framework to support further NS-DNN developments.

PLMay 11, 2025
A Formally Verified Robustness Certifier for Neural Networks (Extended Version)

James Tobler, Hira Taqdees Syeda, Toby Murray

Neural networks are often susceptible to minor perturbations in input that cause them to misclassify. A recent solution to this problem is the use of globally-robust neural networks, which employ a function to certify that the classification of an input cannot be altered by such a perturbation. Outputs that pass this test are called certified robust. However, to the authors' knowledge, these certification functions have not yet been verified at the implementation level. We demonstrate how previous unverified implementations are exploitably unsound in certain circumstances. Moreover, they often rely on approximation-based algorithms, such as power iteration, that (perhaps surprisingly) do not guarantee soundness. To provide assurance that a given output is robust, we implemented and formally verified a certification function for globally-robust neural networks in Dafny. We describe the program, its specifications, and the important design decisions taken for its implementation and verification, as well as our experience applying it in practice.

SEOct 22, 2024
Memory-Efficient Large Language Models for Program Repair with Semantic-Guided Patch Generation

Thanh Le-Cong, Bach Le, Toby Murray

In this paper, we first show that increases in beam size, even for small-sized LLMs (1B-7B params), require extensive GPU usage, leading to up to 80% of recurring crashes due to memory overloads in LLM-based APR. Seemingly simple solutions to reduce memory consumption are (1) to quantize LLM models, i.e., converting the weights of an LLM from high-precision values to lower-precision ones, and (2) to make beam search sequential, i.e., forwarding each beam through the model sequentially and then concatenating them back into a single output. However, we show that these approaches still do not work via both theoretical analysis and experiments. To address this, we introduce FLAMES, a novel LLM-based APR technique that employs semantic-guided patch generation to enhance repair effectiveness and memory efficiency. Unlike conventional methods that rely on beam search, FLAMES utilizes greedy decoding to enhance memory efficiency while steering the search towards more potentially good repair candidates via a semantic-guided best-first search algorithm. At each decoding step, FLAMES uses semantic feedback from test validation, such as the number of passing and failing test cases, to select the most promising token to explore further. Our empirical evaluation on Defects4J shows thatFLAMES substantially reduces memory consumption by up to 83% compared to LLM-based APR without compromising time efficiency. Moreover, FLAMES correctly fixes 133 bugs on Defects4J, fixing 10 bugs more than the best baseline. Additionally, these improvements also generalize to the HumanEval-Java and TransformedD4J datasets, where FLAMES generates 12% and 36.5% more correct patches, respectively, than the best baseline.

SEDec 24, 2021
State Selection Algorithms and Their Impact on The Performance of Stateful Network Protocol Fuzzing

Dongge Liu, Van-Thuan Pham, Gidon Ernst et al.

The statefulness property of network protocol implementations poses a unique challenge for testing and verification techniques, including Fuzzing. Stateful fuzzers tackle this challenge by leveraging state models to partition the state space and assist the test generation process. Since not all states are equally important and fuzzing campaigns have time limits, fuzzers need effective state selection algorithms to prioritize progressive states over others. Several state selection algorithms have been proposed but they were implemented and evaluated separately on different platforms, making it hard to achieve conclusive findings. In this work, we evaluate an extensive set of state selection algorithms on the same fuzzing platform that is AFLNet, a state-of-the-art fuzzer for network servers. The algorithm set includes existing ones supported by AFLNet and our novel and principled algorithm called AFLNetLegion. The experimental results on the ProFuzzBench benchmark show that (i) the existing state selection algorithms of AFLNet achieve very similar code coverage, (ii) AFLNetLegion clearly outperforms these algorithms in selected case studies, but (iii) the overall improvement appears insignificant. These are unexpected yet interesting findings. We identify problems and share insights that could open opportunities for future research on this topic.

CRSep 8, 2021
SecRSL: Security Separation Logic for C11 Release-Acquire Concurrency (Extended version with technical appendices)

Pengbo Yan, Toby Murray

We present Security Relaxed Separation Logic (SecRSL), a separation logic for proving information-flow security of C11 programs in the Release-Acquire fragment with relaxed accesses. SecRSL is the first security logic that (1) supports weak-memory reasoning about programs in a high-level language; (2) inherits separation logic's virtues of compositional, local reasoning about (3) expressive security policies like value-dependent classification. SecRSL is also, to our knowledge, the first security logic developed over an axiomatic memory model. Thus we also present the first definitions of information-flow security for an axiomatic weak memory model, against which we prove SecRSL sound. SecRSL ensures that programs satisfy a constant-time security guarantee, while being free of undefined behaviour. We apply SecRSL to implement and verify the functional correctness and constant-time security of a range of concurrency primitives, including a spinlock module, a mixed-sensitivity mutex, and multiple synchronous channel implementations. Empirical performance evaluations of the latter demonstrate SecRSL's power to support the development of secure and performant concurrent C programs.

PLJul 12, 2021
Compositional Vulnerability Detection with Insecurity Separation Logic (Extended Version)

Toby Murray, Pengbo Yan, Gidon Ernst

Memory-safety issues and information leakage are known to be depressingly common. We consider the compositional static detection of these kinds of vulnerabilities in first-order C-like programs. Indeed the latter are relational hyper-safety violations, comparing pairs of program executions, making them more challenging to detect than the former, which require reasoning only over individual executions. Existing symbolic leakage detection methods treat only non-interactive programs, avoiding the challenges of nondeterminism. Also, being whole-program analyses they cannot be applied one-function-at-a-time, thereby ruling out incremental analysis. We remedy these shortcomings by presenting Insecurity Separation Logic (InsecSL), an under-approximate relational program logic for soundly detecting information leakage and memory-safety issues in interactive programs. Importantly, InsecSL reasons about pairs of executions, and so is relational, but purposefully resembles the non-relational Incorrectness Separation Logic (ISL) that is already automated in the Infer tool. We show how InsecSL can be automated by bi-abduction based symbolic execution, and we evaluate two implementations of this idea (one based on Infer) on various case-studies.

PLOct 27, 2020
Verified Secure Compilation for Mixed-Sensitivity Concurrent Programs

Robert Sison, Toby Murray

Proving only over source code that programs do not leak sensitive data leaves a gap between reasoning and reality that can only be filled by accounting for the behaviour of the compiler. Furthermore, software does not always have the luxury of limiting itself to single-threaded computation with resources statically dedicated to each user to ensure the confidentiality of their data. This results in mixed-sensitivity concurrent programs, which might reuse memory shared between their threads to hold data of different sensitivity levels at different times; for such programs, a compiler must preserve the value-dependent coordination of such mixed-sensitivity reuse despite the impact of concurrency. Here we demonstrate, using Isabelle/HOL, that it is feasible to verify that a compiler preserves noninterference, the strictest kind of confidentiality property, for mixed-sensitivity concurrent programs. First, we present notions of refinement that preserve a concurrent value-dependent notion of noninterference that we have designed to support such programs. As proving noninterference-preserving refinement can be considerably more complex than the standard refinements typically used to verify semantics-preserving compilation, our notions include a decomposition principle that separates the semantics-preservation from security-preservation concerns. Second, we demonstrate that these refinement notions are applicable to verified secure compilation, by exercising them on a single-pass compiler for mixed-sensitivity concurrent programs that synchronise using mutex locks, from a generic imperative language to a generic RISC-style assembly language. Finally, we execute our compiler on a nontrivial mixed-sensitivity concurrent program modelling a real-world use case, thus preserving its source-level noninterference properties down to an assembly-level model automatically. (Full abstract in paper)

SEFeb 15, 2020
Legion: Best-First Concolic Testing

Dongge Liu, Gidon Ernst, Toby Murray et al.

Concolic execution and fuzzing are two complementary coverage-based testing techniques. How to achieve the best of both remains an open challenge. To address this research problem, we propose and evaluate Legion. Legion re-engineers the Monte Carlo tree search (MCTS) framework from the AI literature to treat automated test generation as a problem of sequential decision-making under uncertainty. Its best-first search strategy provides a principled way to learn the most promising program states to investigate at each search iteration, based on observed rewards from previous iterations. Legion incorporates a form of directed fuzzing that we call approximate path-preserving fuzzing (APPFuzzing) to investigate program states selected by MCTS. APPFuzzing serves as the Monte Carlo simulation technique and is implemented by extending prior work on constrained sampling. We evaluate Legion against competitors on 2531 benchmarks from the coverage category of Test-Comp 2020, as well as measuring its sensitivity to hyperparameters, demonstrating its effectiveness on a wide variety of input programs.

LOJul 1, 2019
Verifying that a compiler preserves concurrent value-dependent information-flow security

Robert Sison, Toby Murray

It is common to prove by reasoning over source code that programs do not leak sensitive data. But doing so leaves a gap between reasoning and reality that can only be filled by accounting for the behaviour of the compiler. This task is complicated when programs enforce value-dependent information-flow security properties (in which classification of locations can vary depending on values in other locations) and complicated further when programs exploit shared-variable concurrency. Prior work has formally defined a notion of concurrency-aware refinement for preserving value-dependent security properties. However, that notion is considerably more complex than standard refinement definitions typically applied in the verification of semantics preservation by compilers. To date it remains unclear whether it can be applied to a realistic compiler, because there exist no general decomposition principles for separating it into smaller, more familiar, proof obligations. In this work, we provide such a decomposition principle, which we show can almost halve the complexity of proving secure refinement. Further, we demonstrate its applicability to secure compilation, by proving in Isabelle/HOL the preservation of value-dependent security by a proof-of-concept compiler from an imperative While language to a generic RISC-style assembly language, for programs with shared-memory concurrency mediated by locking primitives. Finally, we execute our compiler in Isabelle on a While language model of the Cross Domain Desktop Compositor, demonstrating to our knowledge the first use of a compiler verification result to carry an information-flow security property down to the assembly-level model of a non-trivial concurrent program.

CRMay 2, 2019
Empirically Analyzing Ethereum's Gas Mechanism

Renlord Yang, Toby Murray, Paul Rimba et al.

Ethereum's Gas mechanism attempts to set transaction fees in accordance with the computational cost of transaction execution: a cost borne by default by every node on the network to ensure correct smart contract execution. Gas encourages users to author transactions that are efficient to execute and in so doing encourages node diversity, allowing modestly resourced nodes to join and contribute to the security of the network. However, the effectiveness of this scheme relies on Gas costs being correctly aligned with observed computational costs in reality. In this work, we performed the first large scale empirical study to understand to what degree this alignment exists in practice, by collecting and analyzing Tera-bytes worth of nanosecond-precision transaction execution traces. Besides confirming potential denial-of-service vectors, our results also shed light on the role of I/O in transaction costs which remains poorly captured by the current Gas cost model. Finally, our results suggest that under the current Gas cost model, nodes with modest computational resources are disadvantaged compared to their better resourced peers, which we identify as an ongoing threat to node diversity and network decentralization.