Tevfik Bultan

SE
8papers
23citations
Novelty55%
AI Score55

8 Papers

79.6CRApr 7Code
Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery

Md Shafiuzzaman, Achintya Desai, Wenbo Guo et al.

Symbolic execution detects vulnerabilities with precision, but applying it to large codebases requires harnesses that set up symbolic state, model dependencies, and specify assertions. Writing these harnesses has traditionally been a manual process requiring expert knowledge, which significantly limits the scalability of the technique. We present Static Analysis Informed and LLM-Orchestrated Symbolic Execution (SAILOR), which automates symbolic execution harness construction by combining static analysis with LLM-based synthesis. SAILOR operates in three phases: (1) static analysis identifies candidate vulnerable locations and generates vulnerability specifications; (2) an LLM uses vulnerability specifications and orchestrates harness synthesis by iteratively refining drivers, stubs, and assertions against compiler and symbolic execution feedback; symbolic execution then detects vulnerabilities using the generated harness, and (3) concrete replay validates the symbolic execution results against the unmodified project source. This design combines the scalability of static analysis, the code reasoning of LLMs, the path precision of symbolic execution, and the ground truth produced by concrete execution. We evaluate SAILOR on 10 open-source C/C++ projects totaling 6.8 M lines of code. SAILOR discovers 379 distinct, previously unknown memory-safety vulnerabilities (421 confirmed crashes). The strongest of five baselines we compare SAILOR to (agentic vulnerability detection using Claude Code with full codebase access and unlimited interaction), finds only 12 vulnerabilities. Each phase of SAILOR is critical: Without static analysis targeting confirmed vulnerabilities drop 12.2X; without iterative LLM synthesis zero vulnerabilities are confirmed; and without symbolic execution no approach can detect more than 12 vulnerabilities.

35.6SEMay 7
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation

Laboni Sarker, Mara Downing, Achintya Desai et al.

Rapid advances in the field of Large Language Models (LLMs) have made LLM-based code generation an important area for investigation. An LLM-based code generator takes a prompt as input and produces code that implements the requirements specified in the prompt. Many software requirements include mathematical formulas that specify the expected behavior of the code to be generated. Given a code generation prompt that contains a mathematical formula, a reasonable expectation is that, if the formula is syntactically modified without changing its semantics, the generated code for the modified prompt should be semantically equivalent. We formalize this concept as syntactic robustness and investigate the syntactic robustness of LLMs as code generators. Our experimental assessment demonstrates that LLMs are not syntactically robust for code generation prompts with formulas, especially for the ones that require mathematical reasoning. We investigate attack strategies that can further deteriorate the syntactic robustness of LLMs. Finally, to mitigate syntactic robustness failures in LLMs, we propose a pre-processing step that uses reductions to transform formulas in prompts to a simplified form. Our experimental results demonstrate that the syntactic robustness of LLM-based code generation improves significantly using our approach, improving syntactic robustness of LLMs from 54.05% to 74.42%.

45.1PLMay 11Code
Quantitative Symbolic Patch Impact Analysis

Laboni Sarker, Abdus Satter, Tevfik Bultan

Traditional equivalence checking classifies programs as equivalent or non-equivalent, providing insufficient information for tasks like patch impact analysis where it is expected the patched version of the program to be non-equivalent to the original program. When two program versions are non-equivalent, determining under what conditions they differ and what percentage of inputs are affected remains an open challenge. In this work, we introduce quantitative partial equivalence analysis, an approach for assessing software patches by quantifying behavioral differences between the original (vulnerable) code and the patched code. Using symbolic analysis, we identify input conditions under which patched and original programs exhibit identical or divergent behaviors. Our approach refines non-equivalence by measuring the extent of behavioral divergence across the input domain. For efficient quantitative analysis of numerical domains, we propose a range-based search heuristic that provides a sound lower bound on equivalence. We demonstrate our approach on 90 CVE patches from widely used open-source projects (Linux, Qemu, FFmpeg), as well as on a Juliet Test Suite-based dataset containing programs with CWEs. Our results show that quantitative partial equivalence analysis effectively characterizes and quantifies patch impact. Additionally, experiments on the EqBench benchmark reveal five C program pairs that are mislabeled as equivalent, and we identify the input conditions under which their behaviors diverge.

75.2SEApr 8
Program Analysis Guided LLM Agent for Proof-of-Concept Generation

Achintya Desai, Md Shafiuzzaman, Wenbo Guo et al.

Software developers frequently receive vulnerability reports that require them to reproduce the vulnerability in a reliable manner by generating a proof-of-concept (PoC) input that triggers it. Given the source code for a software project and a specific code location for a potential vulnerability, automatically generating a PoC for the given vulnerability has been a challenging research problem. Symbolic execution and fuzzing techniques require expert guidance and manual steps and face scalability challenges for PoC generation. Although recent advances in LLMs have increased the level of automation and scalability, the success rate of PoC generation with LLMs remains quite low. In this paper, we present a novel approach called Program Analysis Guided proof of concept generation agENT (PAGENT) that is scalable and significantly improves the success rate of automated PoC generation compared to prior results. PAGENT integrates lightweight and rule-based static analysis phases for providing static analysis guidance and sanitizer-based profiling and coverage information for providing dynamic analysis guidance with a PoC generation agent. Our experiments demonstrate that the resulting hybrid approach significantly outperforms the prior top-performing agentic approach by 132% for the PoC generation task.

SEDec 5, 2025
Fuzzing the brain: Automated stress testing for the safety of ML-driven neurostimulation

Mara Downing, Matthew Peng, Jacob Granley et al.

Objective: Machine learning (ML) models are increasingly used to generate electrical stimulation patterns in neuroprosthetic devices such as visual prostheses. While these models promise precise and personalized control, they also introduce new safety risks when model outputs are delivered directly to neural tissue. We propose a systematic, quantitative approach to detect and characterize unsafe stimulation patterns in ML-driven neurostimulation systems. Approach: We adapt an automated software testing technique known as coverage-guided fuzzing to the domain of neural stimulation. Here, fuzzing performs stress testing by perturbing model inputs and tracking whether resulting stimulation violates biophysical limits on charge density, instantaneous current, or electrode co-activation. The framework treats encoders as black boxes and steers exploration with coverage metrics that quantify how broadly test cases span the space of possible outputs and violation types. Main results: Applied to deep stimulus encoders for the retina and cortex, the method systematically reveals diverse stimulation regimes that exceed established safety limits. Two violation-output coverage metrics identify the highest number and diversity of unsafe outputs, enabling interpretable comparisons across architectures and training strategies. Significance: Violation-focused fuzzing reframes safety assessment as an empirical, reproducible process. By transforming safety from a training heuristic into a measurable property of the deployed model, it establishes a foundation for evidence-based benchmarking, regulatory readiness, and ethical assurance in next-generation neural interfaces.

CRSep 1, 2021
CorbFuzz: Checking Browser Security Policies with Fuzzing

Chaofan Shou, Ismet Burak Kadron, Qi Su et al.

Browsers use security policies to block malicious behaviors. Cross-Origin Read Blocking (CORB) is a browser security policy for preventing side-channel attacks such as Spectre. We propose a web browser security policy fuzzer called CorbFuzz for checking CORB and similar policies. In implementing a security policy, the browser only has access to HTTP requests and responses, and takes policy actions based solely on those interactions. In checking the browser security policies, CorbFuzz uses a policy oracle that tracks the web application behavior and infers the desired policy action based on the web application state. By comparing the policy oracle with the browser behavior, CorbFuzz detects weaknesses in browser security policies. CorbFuzz checks the web browser policy by fuzzing a set of web applications where the state-related queries are symbolically evaluated for increased coverage and automation. CorbFuzz collects type information from database queries and branch conditions in order to prevent the generation of inconsistent data values during fuzzing. We evaluated CorbFuzz on CORB implementations of Chromium and Webkit, and Opaque Response Blocking (ORB) policy implementation of Firefox using web applications collected from GitHub. We found three classes of weaknesses in Chromium's implementation of CORB.

SEJul 26, 2019
Attack Synthesis for Strings using Meta-Heuristics

Seemanta Saha, Ismet Burak Kadron, William Eiers et al.

Information leaks are a significant problem in modern computer systems and string manipulation is prevalent in modern software. We present techniques for automated synthesis of side-channel attacks that recover secret string values based on timing observations on string manipulating code. Our attack synthesis techniques iteratively generate inputs which, when fed to code that accesses the secret, reveal partial information about the secret based on the timing observations, leading to recovery of the secret at the end of the attack sequence. We use symbolic execution to extract path constraints, automata-based model counting to estimate the probability of execution paths, and meta-heuristic methods to maximize information gain based on entropy for synthesizing adaptive attack steps.

SEMay 14, 2019
Incremental Adaptive Attack Synthesis

Seemanta Saha, William Eiers, Ismet Burak Kadron et al.

Information leakage is a significant problem in modern software systems. Information leaks due to side channels are especially hard to detect and analyze. In this paper, we present techniques for automated synthesis of adaptive side-channel attacks that recover secret values. Our attack synthesis techniques iteratively generate inputs which, when fed to code that accesses the secret, reveal partial information about the secret based on the side-channel observations, reducing the remaining uncertainty about the secret in each attack step. Our approach is incremental, reusing results from prior iterations in each attack step to improve the efficiency of attack synthesis. We use symbolic execution to extract path constraints, automata-based model counting to estimate probabilities of execution paths, and meta-heuristics to maximize information gain based on entropy in order to minimize the number of synthesized attack steps.