99.4SEMar 16Code
SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering?Tingxu Han, Yi Zhang, Wei Song et al.
Agent skills, structured procedural knowledge packages injected at inference time, are increasingly used to augment LLM agents on software engineering tasks. However, their real utility in end-to-end development settings remains unclear. We present SWE-Skills-Bench, the first requirement-driven benchmark that isolates the marginal utility of agent skills in real-world software engineering (SWE). It pairs 49 public SWE skills with authentic GitHub repositories pinned at fixed commits and requirement documents with explicit acceptance criteria, yielding approximately 565 task instances across six SWE subdomains. We introduce a deterministic verification framework that maps each task's acceptance criteria to execution-based tests, enabling controlled paired evaluation with and without the skill. Our results show that skill injection benefits are far more limited than rapid adoption suggests: 39 of 49 skills yield zero pass-rate improvement, and the average gain is only +1.2%. Token overhead varies from modest savings to a 451% increase while pass rates remain unchanged. Only seven specialized skills produce meaningful gains (up to +30%), while three degrade performance (up to -10%) due to version-mismatched guidance conflicting with project context. These findings suggest that agent skills are a narrow intervention whose utility depends strongly on domain fit, abstraction level, and contextual compatibility. SWE-Skills-Bench provides a testbed for evaluating the design, selection, and deployment of skills in software engineering agents. SWE-Skills-Bench is available at https://github.com/GeniusHTX/SWE-Skills-Bench.
CRMay 25, 2022
VeriFi: Towards Verifiable Federated UnlearningXiangshan Gao, Xingjun Ma, Jingyi Wang et al.
Federated learning (FL) is a collaborative learning paradigm where participants jointly train a powerful model without sharing their private data. One desirable property for FL is the implementation of the right to be forgotten (RTBF), i.e., a leaving participant has the right to request to delete its private data from the global model. However, unlearning itself may not be enough to implement RTBF unless the unlearning effect can be independently verified, an important aspect that has been overlooked in the current literature. In this paper, we prompt the concept of verifiable federated unlearning, and propose VeriFi, a unified framework integrating federated unlearning and verification that allows systematic analysis of the unlearning and quantification of its effect, with different combinations of multiple unlearning and verification methods. In VeriFi, the leaving participant is granted the right to verify (RTV), that is, the participant notifies the server before leaving, then actively verifies the unlearning effect in the next few communication rounds. The unlearning is done at the server side immediately after receiving the leaving notification, while the verification is done locally by the leaving participant via two steps: marking (injecting carefully-designed markers to fingerprint the leaver) and checking (examining the change of the global model's performance on the markers). Based on VeriFi, we conduct the first systematic and large-scale study for verifiable federated unlearning, considering 7 unlearning methods and 5 verification methods. Particularly, we propose a more efficient and FL-friendly unlearning method, and two more effective and robust non-invasive-verification methods. We extensively evaluate VeriFi on 7 datasets and 4 types of deep learning models. Our analysis establishes important empirical understandings for more trustworthy federated unlearning.
SEAug 5, 2022
An Overview of Structural Coverage Metrics for Testing Neural NetworksMuhammad Usman, Youcheng Sun, Divya Gopinath et al.
Deep neural network (DNN) models, including those used in safety-critical domains, need to be thoroughly tested to ensure that they can reliably perform well in different scenarios. In this article, we provide an overview of structural coverage metrics for testing DNN models, including neuron coverage (NC), k-multisection neuron coverage (kMNC), top-k neuron coverage (TKNC), neuron boundary coverage (NBC), strong neuron activation coverage (SNAC) and modified condition/decision coverage (MC/DC). We evaluate the metrics on realistic DNN models used for perception tasks (including LeNet-1, LeNet-4, LeNet-5, and ResNet20) as well as on networks used in autonomy (TaxiNet). We also provide a tool, DNNCov, which can measure the testing coverage for all these metrics. DNNCov outputs an informative coverage report to enable researchers and practitioners to assess the adequacy of DNN testing, compare different coverage measures, and to more conveniently inspect the model's internals during testing.
65.9MAApr 11Code
ClawMobile: Rethinking Smartphone-Native Agentic SystemsHongchao Du, Shangyu Wu, Qiao Li et al.
Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed. We introduce ClawMobile as a concrete exploration of this design space. ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways, improving execution stability and reproducibility on real devices. Using ClawMobile as a case study, we distill the design principles for mobile LLM runtimes and identify key challenges in efficiency, adaptability, and stability. We argue that building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced~\footnote{https://github.com/ClawMobile/ClawMobile} to facilitate future exploration.
CRMay 8, 2022
VPN: Verification of Poisoning in Neural NetworksYoucheng Sun, Muhammad Usman, Divya Gopinath et al.
Neural networks are successfully used in a variety of applications, many of them having safety and security concerns. As a result researchers have proposed formal verification techniques for verifying neural network properties. While previous efforts have mainly focused on checking local robustness in neural networks, we instead study another neural network security issue, namely data poisoning. In this case an attacker inserts a trigger into a subset of the training data, in such a way that at test time, this trigger in an input causes the trained model to misclassify to some target class. We show how to formulate the check for data poisoning as a property that can be checked with off-the-shelf verification tools, such as Marabou and nneum, where counterexamples of failed checks constitute the triggers. We further show that the discovered triggers are `transferable' from a small model to a larger, better-trained model, allowing us to analyze state-of-the art performant models trained for image classification tasks.
NENov 24, 2022
AIREPAIR: A Repair Platform for Neural NetworksXidan Song, Youcheng Sun, Mustafa A. Mustafa et al.
We present AIREPAIR, a platform for repairing neural networks. It features the integration of existing network repair tools. Based on AIREPAIR, one can run different repair methods on the same model, thus enabling the fair comparison of different repair techniques. We evaluate AIREPAIR with three state-of-the-art repair tools on popular deep-learning datasets and models. Our evaluation confirms the utility of AIREPAIR, by comparing and analyzing the results from different repair techniques. A demonstration is available at https://youtu.be/UkKw5neeWhw.
LGJun 23, 2023
QNNRepair: Quantized Neural Network RepairXidan Song, Youcheng Sun, Mustafa A. Mustafa et al.
We present QNNRepair, the first method in the literature for repairing quantized neural networks (QNNs). QNNRepair aims to improve the accuracy of a neural network model after quantization. It accepts the full-precision and weight-quantized neural networks and a repair dataset of passing and failing tests. At first, QNNRepair applies a software fault localization method to identify the neurons that cause performance degradation during neural network quantization. Then, it formulates the repair problem into a linear programming problem of solving neuron weights parameters, which corrects the QNN's performance on failing tests while not compromising its performance on passing tests. We evaluate QNNRepair with widely used neural network architectures such as MobileNetV2, ResNet, and VGGNet on popular datasets, including high-resolution images. We also compare QNNRepair with the state-of-the-art data-free quantization method SQuant. According to the experiment results, we conclude that QNNRepair is effective in improving the quantized model's performance in most cases. Its repaired models have 24% higher accuracy than SQuant's in the independent validation set, especially for the ImageNet dataset.
AINov 23, 2022
Safety Analysis of Autonomous Driving Systems Based on Model LearningRenjue Li, Tianhang Qin, Pengfei Yang et al.
We present a practical verification method for safety analysis of the autonomous driving system (ADS). The main idea is to build a surrogate model that quantitatively depicts the behaviour of an ADS in the specified traffic scenario. The safety properties proved in the resulting surrogate model apply to the original ADS with a probabilistic guarantee. Furthermore, we explore the safe and the unsafe parameter space of the traffic scenario for driving hazards. We demonstrate the utility of the proposed approach by evaluating safety properties on the state-of-the-art ADS in literature, with a variety of simulated traffic scenarios.
SESep 13, 2024
FAST: Boosting Uncertainty-based Test Prioritization Methods for Neural Networks via Feature SelectionJialuo Chen, Jingyi Wang, Xiyue Zhang et al.
Due to the vast testing space, the increasing demand for effective and efficient testing of deep neural networks (DNNs) has led to the development of various DNN test case prioritization techniques. However, the fact that DNNs can deliver high-confidence predictions for incorrectly predicted examples, known as the over-confidence problem, causes these methods to fail to reveal high-confidence errors. To address this limitation, in this work, we propose FAST, a method that boosts existing prioritization methods through guided FeAture SelecTion. FAST is based on the insight that certain features may introduce noise that affects the model's output confidence, thereby contributing to high-confidence errors. It quantifies the importance of each feature for the model's correct predictions, and then dynamically prunes the information from the noisy features during inference to derive a new probability vector for the uncertainty estimation. With the help of FAST, the high-confidence errors and correctly classified examples become more distinguishable, resulting in higher APFD (Average Percentage of Fault Detection) values for test prioritization, and higher generalization ability for model enhancement. We conduct extensive experiments to evaluate FAST across a diverse set of model structures on multiple benchmark datasets to validate the effectiveness, efficiency, and scalability of FAST compared to the state-of-the-art prioritization techniques.
63.3SEMay 20
Agentic Model CheckingYoucheng Sun, Jiawen Liu, Daniel Kroening et al.
Verifying LLM-generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness-relevant decision. The paradigm rests on three commitments. Specifications are inferred top-down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional-correctness clauses lifting verification from panic-freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition-constrained stubs, so per-query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in-tree crashes from latent public-API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC-Agent and evaluate it on LLM-generated kernel and compiler code in C and Rust alongside mature OSS-Fuzz-hardened libraries, confirming real defects, producing bounded clean verifications on heavily-fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.
SEFeb 11
VulReaD: Knowledge-Graph-guided Software Vulnerability Reasoning and DetectionSamal Mukhtar, Yinghua Yao, Zhu Sun et al.
Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.
IRApr 4, 2024Code
KG4RecEval: Does Knowledge Graph Really Matter for Recommender Systems?Haonan Zhang, Dongxia Wang, Zhu Sun et al.
Recommender systems (RSs) are designed to provide personalized recommendations to users. Recently, knowledge graphs (KGs) have been widely introduced in RSs to improve recommendation accuracy. In this study, however, we demonstrate that RSs do not necessarily perform worse even if the KG is downgraded to the user-item interaction graph only (or removed). We propose an evaluation framework KG4RecEval to systematically evaluate how much a KG contributes to the recommendation accuracy of a KG-based RS, using our defined metric KGER (KG utilization efficiency in recommendation). We consider the scenarios where knowledge in a KG gets completely removed, randomly distorted and decreased, and also where recommendations are for cold-start users. Our extensive experiments on four commonly used datasets and a number of state-of-the-art KG-based RSs reveal that: to remove, randomly distort or decrease knowledge does not necessarily decrease recommendation accuracy, even for cold-start users. These findings inspire us to rethink how to better utilize knowledge from existing KGs, whereby we discuss and provide insights into what characteristics of datasets and KG-based RSs may help improve KG utilization efficiency. The code and supplementary material of this paper are available at: https://github.com/HotBento/KG4RecEval.
CRJul 10, 2024
Was it Slander? Towards Exact Inversion of Generative Language ModelsAdrians Skapars, Edoardo Manino, Youcheng Sun et al.
Training large language models (LLMs) requires a substantial investment of time and money. To get a good return on investment, the developers spend considerable effort ensuring that the model never produces harmful and offensive outputs. However, bad-faith actors may still try to slander the reputation of an LLM by publicly reporting a forged output. In this paper, we show that defending against such slander attacks requires reconstructing the input of the forged output or proving that it does not exist. To do so, we propose and evaluate a search based approach for targeted adversarial attacks for LLMs. Our experiments show that we are rarely able to reconstruct the exact input of an arbitrary output, thus demonstrating that LLMs are still vulnerable to slander attacks.
80.2AIMay 16
Responsible Agentic AI Requires Explicit ProvenanceJinwei Hu, Xinmiao Huang, Qisong He et al.
Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.
CLJan 4Code
Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative MontageJinwei Hu, Xinmiao Huang, Youcheng Sun et al.
As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.
SEAug 3, 2021Code
Tutorials on Testing Neural NetworksNicolas Berthier, Youcheng Sun, Wei Huang et al.
Deep learning achieves remarkable performance on pattern recognition, but can be vulnerable to defects of some important properties such as robustness and security. This tutorial is based on a stream of research conducted since the summer of 2018 at a few UK universities, including the University of Liverpool, University of Oxford, Queen's University Belfast, University of Lancaster, University of Loughborough, and University of Exeter. The research aims to adapt software engineering methods, in particular software testing methods, to work with machine learning models. Software testing techniques have been successful in identifying software bugs, and helping software developers in validating the software they design and implement. It is for this reason that a few software testing techniques -- such as the MC/DC coverage metric -- have been mandated in industrial standards for safety critical systems, including the ISO26262 for automotive systems and the RTCA DO-178B/C for avionics systems. However, these techniques cannot be directly applied to machine learning models, because the latter are drastically different from traditional software, and their design follows a completely different development life-cycle. As the outcome of this thread of research, the team has developed a series of methods that adapt the software testing techniques to work with a few classes of machine learning models. The latter notably include convolutional neural networks, recurrent neural networks, and random forest. The tools developed from this research are now collected, and publicly released, in a GitHub repository: \url{https://github.com/TrustAI/DeepConcolic}, with the BSD 3-Clause licence. This tutorial is to go through the major functionalities of the tools with a few running examples, to exhibit how the developed techniques work, what the results are, and how to interpret them.
NEJun 20, 2019Code
testRNN: Coverage-guided Testing on Recurrent Neural NetworksWei Huang, Youcheng Sun, Xiaowei Huang et al.
Recurrent neural networks (RNNs) have been widely applied to various sequential tasks such as text processing, video recognition, and molecular property prediction. We introduce the first coverage-guided testing tool, coined testRNN, for the verification and validation of a major class of RNNs, long short-term memory networks (LSTMs). The tool implements a generic mutation-based test case generation method, and it empirically evaluates the robustness of a network using three novel LSTM structural test coverage metrics. Moreover, it is able to help the model designer go through the internal data flow processing of the LSTM layer. The tool is available through: https://github.com/TrustAI/testRNN under the BSD 3-Clause licence.
LGMar 10, 2018Code
Testing Deep Neural NetworksYoucheng Sun, Xiaowei Huang, Daniel Kroening et al.
Deep neural networks (DNNs) have a wide range of applications, and software employing them must be thoroughly tested, especially in safety-critical domains. However, traditional software test coverage metrics cannot be applied directly to DNNs. In this paper, inspired by the MC/DC coverage criterion, we propose a family of four novel test criteria that are tailored to structural features of DNNs and their semantics. We validate the criteria by demonstrating that the generated test inputs guided via our proposed coverage criteria are able to capture undesired behaviours in a DNN. Test cases are generated using a symbolic approach and a gradient-based heuristic search. By comparing them with existing methods, we show that our criteria achieve a balance between their ability to find bugs (proxied using adversarial examples) and the computational cost of test case generation. Our experiments are conducted on state-of-the-art DNNs obtained using popular open source datasets, including MNIST, CIFAR-10 and ImageNet.
24.1CRMar 10
Compartmentalization-Aware Automated Program RepairJia Hu, Youcheng Sun, Pierre Olivier
Software compartmentalization breaks down an application into compartments isolated from each other: an attacker taking over a compartment will be confined to it, limiting the damage they can cause to the rest of the application. Despite the security promises of this approach, recent studies have shown that most existing compartmentalized software is plagued by vulnerabilities at cross-compartment interfaces, allowing an attacker taking over a compartment to escape its confinement and negate the security guarantees expected from compartmentalization. In that context, securing cross-compartment interfaces is notoriously difficult and engineering-intensive. In light of recent advances in Automated Program Repair (APR), notably through the use of Large Language Models (LLMs), this paper presents a work in progress investigating the suitability of LLM-based APR at securing cross-compartment interfaces as automatically as possible. We observe that existing APR approaches and general purpose/code-centric LLMs used as is are unfit for this task, and present the design, implementation, and early results of a new APR framework dedicated to compartment interface safety. The framework integrates into a feedback loop 1) a specialized fuzzer uncovering cross-compartment interface vulnerabilities; 2) a patch generation component bridging the lack of compartmentalization awareness of existing LLMs with a series of analysis techniques; and 3) a patch validation component assessing the effectiveness of generated vulnerability fixes. We validate our framework over a sample interface vulnerability, comparing it to a naive use of general-purpose LLMs, and discuss future research avenues.
AINov 13, 2024
Causal Explanations for Image ClassifiersHana Chockler, David A. Kelly, Daniel Kroening et al.
Existing algorithms for explaining the output of image classifiers use different definitions of explanations and a variety of techniques to extract them. However, none of the existing tools use a principled approach based on formal definitions of causes and explanations for the explanation extraction. In this paper we present a novel black-box approach to computing explanations grounded in the theory of actual causality. We prove relevant theoretical results and present an algorithm for computing approximate explanations based on these definitions. We prove termination of our algorithm and discuss its complexity and the amount of approximation compared to the precise definition. We implemented the framework in a tool ReX and we present experimental results and a comparison with state-of-the-art tools. We demonstrate that \rex is the most efficient tool and produces the smallest explanations, in addition to outperforming other black-box tools on standard quality measures.
SEDec 25, 2024
SAFLITE: Fuzzing Autonomous Systems via Large Language ModelsTaohong Zhu, Adrians Skapars, Fardeen Mackenzie et al.
Fuzz testing effectively uncovers software vulnerabilities; however, it faces challenges with Autonomous Systems (AS) due to their vast search spaces and complex state spaces, which reflect the unpredictability and complexity of real-world environments. This paper presents a universal framework aimed at improving the efficiency of fuzz testing for AS. At its core is SaFliTe, a predictive component that evaluates whether a test case meets predefined safety criteria. By leveraging the large language model (LLM) with information about the test objective and the AS state, SaFliTe assesses the relevance of each test case. We evaluated SaFliTe by instantiating it with various LLMs, including GPT-3.5, Mistral-7B, and Llama2-7B, and integrating it into four fuzz testing tools: PGFuzz, DeepHyperion-UAV, CAMBA, and TUMB. These tools are designed specifically for testing autonomous drone control systems, such as ArduPilot, PX4, and PX4-Avoidance. The experimental results demonstrate that, compared to PGFuzz, SaFliTe increased the likelihood of selecting operations that triggered bug occurrences in each fuzzing iteration by an average of 93.1\%. Additionally, after integrating SaFliTe, the ability of DeepHyperion-UAV, CAMBA, and TUMB to generate test cases that caused system violations increased by 234.5\%, 33.3\%, and 17.8\%, respectively. The benchmark for this evaluation was sourced from a UAV Testing Competition.
CRJan 4
Exposing Hidden Interfaces: LLM-Guided Type Inference for Reverse Engineering macOS Private FrameworksArina Kharlamova, Youcheng Sun, Ting Yu
Private macOS frameworks underpin critical services and daemons but remain undocumented and distributed only as stripped binaries, complicating security analysis. We present MOTIF, an agentic framework that integrates tool-augmented analysis with a finetuned large language model specialized for Objective-C type inference. The agent manages runtime metadata extraction, binary inspection, and constraint checking, while the model generates candidate method signatures that are validated and refined into compilable headers. On MOTIF-Bench, a benchmark built from public frameworks with groundtruth headers, MOTIF improves signature recovery from 15% to 86% compared to baseline static analysis tooling, with consistent gains in tool-use correctness and inference stability. Case studies on private frameworks show that reconstructed headers compile, link, and facilitate downstream security research and vulnerability studies. By transforming opaque binaries into analyzable interfaces, MOTIF establishes a scalable foundation for systematic auditing of macOS internals.
SENov 24, 2025
LLM-Driven Kernel Evolution: Automating Driver Updates in LinuxArina Kharlamova, Jiawen Liu, Tianyi Zhang et al.
Linux kernel evolution breaks drivers through API/ABI changes, semantic shifts, and security-hardening updates. We introduce DRIVEBENCH, an executable corpus of kernel$\rightarrow$driver co-evolution cases, and AUTODRIVER, a closed-loop, LLM-driven system for automating driver maintenance. The system integrates prompt engineering, multi-agent collaboration, static analysis, and iterative validation to ensure that generated patches are not only syntactically correct but also functionally and semantically consistent with kernel conventions. The corpus spans v5.10-v6.10 with 235 validated cases drawn from 612 candidates. In evaluation across 55 cases, AUTODRIVER achieves 56.4% compilation success; QEMU-based boot verification indicates that compiled patches preserve driver initialization in most instances. By releasing DRIVEBENCH and tooling, we enable reproducible research and a practical route to continuous, safe co-evolution of drivers with the Linux kernel.
LGJul 2, 2025
GPT, But Backwards: Exactly Inverting Language Model OutputsAdrians Skapars, Edoardo Manino, Youcheng Sun et al.
The task of reconstructing unknown textual inputs to language models is a fundamental auditing primitive that allows us to assess the model's vulnerability to a range of security issues, including stealing hidden system prompts, detecting backdoors, and leaking private data. Existing inversion works assume access to differing levels of information (e.g. requiring input-output examples, the model parameters, intermediate activations or output logits) but oftentimes fail to fully reconstruct the desired input. In this paper, we present the Sparse One-hot Discrete Adam (SODA) algorithm, a search-based inversion method that can accurately reconstruct the input text, given white-box access to the language model and its output. Our experiments demonstrate for the first time that exact language model inversion is possible on both natural language and random inputs. Indeed, SODA achieves respectively 98% and 79% reconstruction rates on inputs with lengths up to 10 tokens. Furthermore, we show that input length and vocabulary size have a far greater impact on the probability of a successful reconstruction than the size of the language model itself, thus allowing us to scale to models from 33M to 3B parameters.
SEJun 9, 2025
Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language ModelsDaniel Koh, Yannic Noller, Corina S. Pasareanu et al.
Large language models (LLMs) have demonstrated strong performance on coding tasks such as generation, completion and repair, but their ability to handle complex symbolic reasoning over code still remains underexplored. We introduce the task of worst-case symbolic constraints analysis, which requires inferring the symbolic constraints that characterise worst-case program executions; these constraints can be solved to obtain inputs that expose performance bottlenecks or denial-of-service vulnerabilities in software systems. We show that even state-of-the-art LLMs (e.g., GPT-5) struggle when applied directly on this task. To address this challenge, we propose WARP, an innovative neurosymbolic approach that computes worst-case constraints on smaller concrete input sizes using existing program analysis tools, and then leverages LLMs to generalise these constraints to larger input sizes. Concretely, WARP comprises: (1) an incremental strategy for LLM-based worst-case reasoning, (2) a solver-aligned neurosymbolic framework that integrates reinforcement learning with SMT (Satisfiability Modulo Theories) solving, and (3) a curated dataset of symbolic constraints. Experimental results show that WARP consistently improves performance on worst-case constraint reasoning. Leveraging the curated constraint dataset, we use reinforcement learning to fine-tune a model, WARP-1.0-3B, which significantly outperforms size-matched and even larger baselines. These results demonstrate that incremental constraint reasoning enhances LLMs' ability to handle symbolic reasoning and highlight the potential for deeper integration between neural learning and formal methods in rigorous program analysis.
LGMay 15, 2025
Defending the Edge: Representative-Attention for Mitigating Backdoor Attacks in Federated LearningChibueze Peace Obioma, Youcheng Sun, Mustafa A. Mustafa
Federated learning (FL) enhances privacy and reduces communication cost for resource-constrained edge clients by supporting distributed model training at the edge. However, the heterogeneous nature of such devices produces diverse, non-independent, and identically distributed (non-IID) data, making the detection of backdoor attacks more challenging. In this paper, we propose a novel federated representative-attention-based defense mechanism, named FeRA, that leverages cross-client attention over internal feature representations to distinguish benign from malicious clients. FeRA computes an anomaly score based on representation reconstruction errors, effectively identifying clients whose internal activations significantly deviate from the group consensus. Our evaluation demonstrates FeRA's robustness across various FL scenarios, including challenging non-IID data distributions typical of edge devices. Experimental results show that it effectively reduces backdoor attack success rates while maintaining high accuracy on the main task. The method is model-agnostic, attack-agnostic, and does not require labeled reference data, making it well suited to heterogeneous and resource-limited edge deployments.
IRMay 12, 2025
GRADA: Graph-based Reranking against Adversarial Documents AttackJingjie Zheng, Aryo Pradipta Gema, Giwon Hong et al.
Retrieval Augmented Generation (RAG) frameworks improve the accuracy of large language models (LLMs) by integrating external knowledge from retrieved documents, thereby overcoming the limitations of models' static intrinsic knowledge. However, these systems are susceptible to adversarial attacks that manipulate the retrieval process by introducing documents that are adversarial yet semantically similar to the query. Notably, while these adversarial documents resemble the query, they exhibit weak similarity to benign documents in the retrieval set. Thus, we propose a simple yet effective Graph-based Reranking against Adversarial Document Attacks (GRADA) framework aiming at preserving retrieval quality while significantly reducing the success of adversaries. Our study evaluates the effectiveness of our approach through experiments conducted on five LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b, Llama3.1-70b, and Qwen2.5-7b. We use three datasets to assess performance, with results from the Natural Questions dataset demonstrating up to an 80% reduction in attack success rates while maintaining minimal loss in accuracy.
SEMay 24, 2023
A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal VerificationNorbert Tihanyi, Ridhi Jain, Yiannis Charalambous et al.
This paper introduces an innovative approach that combines Large Language Models (LLMs) with Formal Verification strategies for automatic software vulnerability repair. Initially, we employ Bounded Model Checking (BMC) to identify vulnerabilities and extract counterexamples. These counterexamples are supported by mathematical proofs and the stack trace of the vulnerabilities. Using a specially designed prompt, we combine the original source code with the identified vulnerability, including its stack trace and counterexample that specifies the line number and error type. This combined information is then fed into an LLM, which is instructed to attempt to fix the code. The new code is subsequently verified again using BMC to ensure the fix succeeded. We present the ESBMC-AI framework as a proof of concept, leveraging the well-recognized and industry-adopted Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained transformer model to detect and fix errors in C programs, particularly in critical software components. We evaluated our approach on 50,000 C programs randomly selected from the FormAI dataset with their respective vulnerability classifications. Our results demonstrate ESBMC-AI's capability to automate the detection and repair of issues such as buffer overflow, arithmetic overflow, and pointer dereference failures with high accuracy. ESBMC-AI is a pioneering initiative, integrating LLMs with BMC techniques, offering potential integration into the continuous integration and deployment (CI/CD) process within the software development lifecycle.
CRJan 31, 2022
AntidoteRT: Run-time Detection and Correction of Poison Attacks on Neural NetworksMuhammad Usman, Youcheng Sun, Divya Gopinath et al.
We study backdoor poisoning attacks against image classification networks, whereby an attacker inserts a trigger into a subset of the training data, in such a way that at test time, this trigger causes the classifier to predict some target class. %There are several techniques proposed in the literature that aim to detect the attack but only a few also propose to defend against it, and they typically involve retraining the network which is not always possible in practice. We propose lightweight automated detection and correction techniques against poisoning attacks, which are based on neuron patterns mined from the network using a small set of clean and poisoned test samples with known labels. The patterns built based on the mis-classified samples are used for run-time detection of new poisoned inputs. For correction, we propose an input correction technique that uses a differential analysis to identify the trigger in the detected poisoned images, which is then reset to a neutral color. Our detection and correction are performed at run-time and input level, which is in contrast to most existing work that is focused on offline model-level defenses. We demonstrate that our technique outperforms existing defenses such as NeuralCleanse and STRIP on popular benchmarks such as MNIST, CIFAR-10, and GTSRB against the popular BadNets attack and the more complex DFST attack.
CRDec 10, 2021
Copy, Right? A Testing Framework for Copyright Protection of Deep Learning ModelsJialuo Chen, Jingyi Wang, Tinglan Peng et al.
Deep learning (DL) models, especially those large-scale and high-performance ones, can be very costly to train, demanding a great amount of data and computational resources. Unauthorized reproduction of DL models can lead to copyright infringement and cause huge economic losses to model owners. Existing copyright protection techniques are mostly based on watermarking, which embeds an owner-specified watermark into the model. While being able to provide exact ownership verification, these techniques are 1) invasive, as they need to tamper with the training process, which may affect the utility or introduce new security risks; 2) prone to adaptive attacks that attempt to remove the watermark; and 3) not robust to the emerging model extraction attacks. Latest fingerprinting work, though being non-invasive, also falls short when facing the diverse and ever-growing attack scenarios. In this paper, we propose a novel testing framework for DL copyright protection: DEEPJUDGE. DEEPJUDGE quantitatively tests the similarities between two DL models: a victim model and a suspect model. It leverages a diverse set of testing metrics and test case generation methods to produce a chain of supporting evidence to help determine whether a suspect model is a copy of the victim model. Advantages of DEEPJUDGE include: 1) non-invasive, as it works directly on the model and does not tamper with the training process; 2) efficient, as it only needs a small set of test cases and a quick scan of models; 3) flexible, as it can easily incorporate new metrics or generation methods to obtain more confident judgement; and 4) fairly robust to model extraction and adaptive attacks. We verify the effectiveness of DEEPJUDGE under typical copyright infringement scenarios, including model finetuning, pruning and extraction, via extensive experiments on both image and speech datasets with a variety of model architectures.
LGMar 23, 2021
NNrepair: Constraint-based Repair of Neural Network ClassifiersMuhammad Usman, Divya Gopinath, Youcheng Sun et al.
We present NNrepair, a constraint-based technique for repairing neural network classifiers. The technique aims to fix the logic of the network at an intermediate layer or at the last layer. NNrepair first uses fault localization to find potentially faulty network parameters (such as the weights) and then performs repair using constraint solving to apply small modifications to the parameters to remedy the defects. We present novel strategies to enable precise yet efficient repair such as inferring correctness specifications to act as oracles for intermediate layer repair, and generation of experts for each class. We demonstrate the technique in the context of three different scenarios: (1) Improving the overall accuracy of a model, (2) Fixing security vulnerabilities caused by poisoning of training data and (3) Improving the robustness of the network against adversarial attacks. Our evaluation on MNIST and CIFAR-10 models shows that NNrepair can improve the accuracy by 45.56 percentage points on poisoned data and 10.40 percentage points on adversarial data. NNrepair also provides small improvement in the overall accuracy of models, without requiring new data or re-training.
LGMar 5, 2021
Explanations for Occluded ImagesHana Chockler, Daniel Kroening, Youcheng Sun
Existing algorithms for explaining the output of image classifiers perform poorly on inputs where the object of interest is partially occluded. We present a novel, black-box algorithm for computing explanations that uses a principled approach based on causal theory. We have implemented the method in the DEEPCOVER tool. We obtain explanations that are much more accurate than those generated by the existing explanation tools on images with occlusions and observe a level of performance comparable to the state of the art when explaining images without occlusions.
LGFeb 27, 2021
NEUROSPF: A tool for the Symbolic Analysis of Neural NetworksMuhammad Usman, Yannic Noller, Corina Pasareanu et al.
This paper presents NEUROSPF, a tool for the symbolic analysis of neural networks. Given a trained neural network model, the tool extracts the architecture and model parameters and translates them into a Java representation that is amenable for analysis using the Symbolic PathFinder symbolic execution tool. Notably, NEUROSPF encodes specialized peer classes for parsing the model's parameters, thereby enabling efficient analysis. With NEUROSPF the user has the flexibility to specify either the inputs or the network internal parameters as symbolic, promoting the application of program analysis and testing approaches from software engineering to the field of machine learning. For instance, NEUROSPF can be used for coverage-based testing and test generation, finding adversarial examples and also constraint-based repair of neural networks, thus improving the reliability of neural networks and of the applications that use them. Video URL: https://youtu.be/seal8fG78LI
SEFeb 11, 2021
RobOT: Robustness-Oriented Testing for Deep Learning SystemsJingyi Wang, Jialuo Chen, Youcheng Sun et al.
Recently, there has been a significant growth of interest in applying software engineering techniques for the quality assurance of deep learning (DL) systems. One popular direction is deep learning testing, where adversarial examples (a.k.a.~bugs) of DL systems are found either by fuzzing or guided search with the help of certain testing metrics. However, recent studies have revealed that the commonly used neuron coverage metrics by existing DL testing approaches are not correlated to model robustness. It is also not an effective measurement on the confidence of the model robustness after testing. In this work, we address this gap by proposing a novel testing framework called Robustness-Oriented Testing (RobOT). A key part of RobOT is a quantitative measurement on 1) the value of each test case in improving model robustness (often via retraining), and 2) the convergence quality of the model robustness improvement. RobOT utilizes the proposed metric to automatically generate test cases valuable for improving model robustness. The proposed metric is also a strong indicator on how well robustness improvement has converged through testing. Experiments on multiple benchmark datasets confirm the effectiveness and efficiency of RobOT in improving DL model robustness, with 67.02% increase on the adversarial robustness that is 50.65% higher than the state-of-the-art work DeepGini.
LGJan 25, 2021
Towards Practical Robustness Analysis for DNNs based on PAC-Model LearningRenjue Li, Pengfei Yang, Cheng-Chao Huang et al.
To analyse local robustness properties of deep neural networks (DNNs), we present a practical framework from a model learning perspective. Based on black-box model learning with scenario optimisation, we abstract the local behaviour of a DNN via an affine model with the probably approximately correct (PAC) guarantee. From the learned model, we can infer the corresponding PAC-model robustness property. The innovation of our work is the integration of model learning into PAC robustness analysis: that is, we construct a PAC guarantee on the model level instead of sample distribution, which induces a more faithful and accurate robustness evaluation. This is in contrast to existing statistical methods without model learning. We implement our method in a prototypical tool named DeepPAC. As a black-box method, DeepPAC is scalable and efficient, especially when DNNs have complex structures or high-dimensional inputs. We extensively evaluate DeepPAC, with 4 baselines (using formal verification, statistical methods, testing and adversarial attack) and 20 DNN models across 3 datasets, including MNIST, CIFAR-10, and ImageNet. It is shown that DeepPAC outperforms the state-of-the-art statistical method PROVERO, and it achieves more practical robustness analysis than the formal verification tool ERAN. Also, its results are consistent with existing DNN testing work like DeepGini.
ROOct 16, 2020
Formal Verification of Robustness and Resilience of Learning-Enabled State Estimation SystemsWei Huang, Yifan Zhou, Gaojie Jin et al.
This paper presents a formal verification guided approach for a principled design and implementation of robust and resilient learning-enabled systems. We focus on learning-enabled state estimation systems (LE-SESs), which have been widely used in robotics applications to determine the current state (e.g., location, speed, direction, etc.) of a complex system. The LE-SESs are networked systems, composed of a set of connected components including: Bayes filters for state estimation, and neural networks for processing sensory input. We study LE-SESs from the perspective of formal verification, which determines the satisfiabilty of a system model against the specified properties. Over LE-SESs, we investigate two key properties -- robustness and resilience -- and provide their formal definitions. To enable formal verification, we reduce the LE-SESs to a novel class of labelled transition systems, named {PO}^2-LTS in the paper, and formally express the properties as constrained optimisation objectives. We prove that the verification problems are NP-complete. Based on {PO}^2-LTS and the optimisation objectives, practical verification algorithms are developed to check the satisfiability of the properties on the LE-SESs. As a major case study, we interrogate a real-world dynamic tracking system which uses a single Kalman Filter (KF) -- a special case of Bayes filter -- to localise and track a ground vehicle. Its perception system, based on convolutional neural networks, processes a high-resolution Wide Area Motion Imagery (WAMI) data stream. Experimental results show that our algorithms can not only verify the properties of the WAMI tracking system but also provide representative examples, the latter of which inspired us to take an enhanced LE-SESs design where runtime monitors or joint-KFs are required. Experimental results confirm the improvement in the robustness of the enhanced design.
LGAug 31, 2020
Ranking Policy DecisionsHadrien Pouget, Hana Chockler, Youcheng Sun et al.
Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret. In a run with $n$ time steps, a policy will make $n$ decisions on actions to take; we conjecture that only a small subset of these decisions delivers value over selecting a simple default action. Given a trained policy, we propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states. We argue that among other things, the ranked list of states can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard. As a proxy for quality, we use the ranking to create new, simpler policies from the original ones by pruning decisions identified as unimportant (that is, replacing them by default actions) and measuring the impact on performance. Our experiments on a diverse set of standard benchmarks demonstrate that pruned policies can perform on a level comparable to the original policies. Conversely, we show that naive approaches for ranking policy decisions, e.g., ranking based on the frequency of visiting a state, do not result in high-performing pruned policies.
CVFeb 6, 2020
Reliability Validation of Learning Enabled Vehicle TrackingYoucheng Sun, Yifan Zhou, Simon Maskell et al.
This paper studies the reliability of a real-world learning-enabled system, which conducts dynamic vehicle tracking based on a high-resolution wide-area motion imagery input. The system consists of multiple neural network components -- to process the imagery inputs -- and multiple symbolic (Kalman filter) components -- to analyse the processed information for vehicle tracking. It is known that neural networks suffer from adversarial examples, which make them lack robustness. However, it is unclear if and how the adversarial examples over learning components can affect the overall system-level reliability. By integrating a coverage-guided neural network testing tool, DeepConcolic, with the vehicle tracking system, we found that (1) the overall system can be resilient to some adversarial examples thanks to the existence of other components, and (2) the overall system presents an extra level of uncertainty which cannot be determined by analysing the deep learning components only. This research suggests the need for novel verification and validation methods for learning-enabled systems.
LGNov 5, 2019
Coverage Guided Testing for Recurrent Neural NetworksWei Huang, Youcheng Sun, Xingyu Zhao et al.
Recurrent neural networks (RNNs) have been applied to a broad range of applications, including natural language processing, drug discovery, and video recognition. Their vulnerability to input perturbation is also known. Aligning with a view from software defect detection, this paper aims to develop a coverage guided testing approach to systematically exploit the internal behaviour of RNNs, with the expectation that such testing can detect defects with high possibility. Technically, the long short term memory network (LSTM), a major class of RNNs, is thoroughly studied. A family of three test metrics are designed to quantify not only the values but also the temporal relations (including both step-wise and bounded-length) exhibited when LSTM processing inputs. A genetic algorithm is applied to efficiently generate test cases. The test metrics and test case generation algorithm are implemented into a tool TestRNN, which is then evaluated on a set of LSTM benchmarks. Experiments confirm that TestRNN has advantages over the state-of-art tool DeepStellar and attack-based defect detection methods, owing to its working with finer temporal semantics and the consideration of the naturalness of input perturbation. Furthermore, TestRNN enables meaningful information to be collected and exhibited for users to understand the testing results, which is an important step towards interpretable neural network testing.
LGAug 6, 2019
Explaining Image Classifiers using Statistical Fault LocalizationYoucheng Sun, Hana Chockler, Xiaowei Huang et al.
The black-box nature of deep neural networks (DNNs) makes it impossible to understand why a particular output is produced, creating demand for "Explainable AI". In this paper, we show that statistical fault localization (SFL) techniques from software engineering deliver high quality explanations of the outputs of DNNs, where we define an explanation as a minimal subset of features sufficient for making the same decision as for the original input. We present an algorithm and a tool called DeepCover, which synthesizes a ranking of the features of the inputs using SFL and constructs explanations for the decisions of the DNN based on this ranking. We compare explanations produced by DeepCover with those of the state-of-the-art tools GradCAM, LIME, SHAP, RISE and Extremal and show that explanations generated by DeepCover are consistently better across a broad set of experiments. On a benchmark set with known ground truth, DeepCover achieves 76.7% accuracy, which is 6% better than the second best Extremal.
LGDec 18, 2018
A Survey of Safety and Trustworthiness of Deep Neural Networks: Verification, Testing, Adversarial Attack and Defence, and InterpretabilityXiaowei Huang, Daniel Kroening, Wenjie Ruan et al.
In the past few years, significant progress has been made on deep neural networks (DNNs) in achieving human-level performance on several long-standing tasks. With the broader deployment of DNNs on various applications, the concerns over their safety and trustworthiness have been raised in public, especially after the widely reported fatal incidents involving self-driving cars. Research to address these concerns is particularly active, with a significant number of papers released in the past few years. This survey paper conducts a review of the current research effort into making DNNs safe and trustworthy, by focusing on four aspects: verification, testing, adversarial attack and defence, and interpretability. In total, we survey 202 papers, most of which were published after 2017.
LGApr 30, 2018
Concolic Testing for Deep Neural NetworksYoucheng Sun, Min Wu, Wenjie Ruan et al.
Concolic testing combines program execution and symbolic analysis to explore the execution paths of a software program. This paper presents the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we formalise coverage criteria for DNNs that have been studied in the literature, and then develop a coherent method for performing concolic testing to increase test coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples.
LGApr 16, 2018
Global Robustness Evaluation of Deep Neural Networks with Provable Guarantees for the $L_0$ NormWenjie Ruan, Min Wu, Youcheng Sun et al.
Deployment of deep neural networks (DNNs) in safety- or security-critical systems requires provable guarantees on their correct behaviour. A common requirement is robustness to adversarial perturbations in a neighbourhood around an input. In this paper we focus on the $L_0$ norm and aim to compute, for a trained DNN and an input, the maximal radius of a safe norm ball around the input within which there are no adversarial examples. Then we define global robustness as an expectation of the maximal safe radius over a test data set. We first show that the problem is NP-hard, and then propose an approximate approach to iteratively compute lower and upper bounds on the network's robustness. The approach is \emph{anytime}, i.e., it returns intermediate bounds and robustness estimates that are gradually, but strictly, improved as the computation proceeds; \emph{tensor-based}, i.e., the computation is conducted over a set of inputs simultaneously, instead of one by one, to enable efficient GPU computation; and has \emph{provable guarantees}, i.e., both the bounds and the robustness estimates can converge to their optimal values. Finally, we demonstrate the utility of the proposed approach in practice to compute tight bounds by applying and adapting the anytime algorithm to a set of challenging problems, including global robustness evaluation, competitive $L_0$ attacks, test case generation for DNNs, and local robustness evaluation on large-scale ImageNet DNNs. We release the code of all case studies via GitHub.
SEJul 5, 2017
Functional Requirements-Based Automated Testing for AvionicsYoucheng Sun, Martin Brain, Daniel Kroening et al.
We propose and demonstrate a method for the reduction of testing effort in safety-critical software development using DO-178 guidance. We achieve this through the application of Bounded Model Checking (BMC) to formal low-level requirements, in order to generate tests automatically that are good enough to replace existing labor-intensive test writing procedures while maintaining independence from implementation artefacts. Given that existing manual processes are often empirical and subjective, we begin by formally defining a metric, which extends recognized best practice from code coverage analysis strategies to generate tests that adequately cover the requirements. We then formulate the automated test generation procedure and apply its prototype in case studies with industrial partners. In review, the method developed here is demonstrated to significantly reduce the human effort for the qualification of software products under DO-178 guidance.