SEAug 1, 2023
The Hitchhiker's Guide to Program Analysis: A Journey with Large Language ModelsHaonan Li, Yu Hao, Yizhuo Zhai et al.
Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated framework that interfaces with both a static analysis tool and an LLM. By carefully designing the framework and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates a potent capability, showcasing a reasonable precision (50%) and appearing to have no missing bugs. It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in using LLMs for bug discovery in extensive, real-world datasets.
LGOct 30, 2025
LLMBisect: Breaking Barriers in Bug Bisection with A Comparative Analysis PipelineZheng Zhang, Haonan Li, Xingyu Li et al.
Bug bisection has been an important security task that aims to understand the range of software versions impacted by a bug, i.e., identifying the commit that introduced the bug. However, traditional patch-based bisection methods are faced with several significant barriers: For example, they assume that the bug-inducing commit (BIC) and the patch commit modify the same functions, which is not always true. They often rely solely on code changes, while the commit message frequently contains a wealth of vulnerability-related information. They are also based on simple heuristics (e.g., assuming the BIC initializes lines deleted in the patch) and lack any logical analysis of the vulnerability. In this paper, we make the observation that Large Language Models (LLMs) are well-positioned to break the barriers of existing solutions, e.g., comprehend both textual data and code in patches and commits. Unlike previous BIC identification approaches, which yield poor results, we propose a comprehensive multi-stage pipeline that leverages LLMs to: (1) fully utilize patch information, (2) compare multiple candidate commits in context, and (3) progressively narrow down the candidates through a series of down-selection steps. In our evaluation, we demonstrate that our approach achieves significantly better accuracy than the state-of-the-art solution by more than 38\%. Our results further confirm that the comprehensive multi-stage pipeline is essential, as it improves accuracy by 60\% over a baseline LLM-based bisection method.
CRSep 26, 2025Code
What Do They Fix? LLM-Aided Categorization of Security Patches for Critical Memory BugsXingyu Li, Juefei Pu, Yifan Wu et al.
Open-source software projects are foundational to modern software ecosystems, with the Linux kernel standing out as a critical exemplar due to its ubiquity and complexity. Although security patches are continuously integrated into the Linux mainline kernel, downstream maintainers often delay their adoption, creating windows of vulnerability. A key reason for this lag is the difficulty in identifying security-critical patches, particularly those addressing exploitable vulnerabilities such as out-of-bounds (OOB) accesses and use-after-free (UAF) bugs. This challenge is exacerbated by intentionally silent bug fixes, incomplete or missing CVE assignments, delays in CVE issuance, and recent changes to the CVE assignment criteria for the Linux kernel. While fine-grained patch classification approaches exist, they exhibit limitations in both coverage and accuracy. In this work, we identify previously unexplored opportunities to significantly improve fine-grained patch classification. Specifically, by leveraging cues from commit titles/messages and diffs alongside appropriate code context, we develop DUALLM, a dual-method pipeline that integrates two approaches based on a Large Language Model (LLM) and a fine-tuned small language model. DUALLM achieves 87.4% accuracy and an F1-score of 0.875, significantly outperforming prior solutions. Notably, DUALLM successfully identified 111 of 5,140 recent Linux kernel patches as addressing OOB or UAF vulnerabilities, with 90 true positives confirmed by manual verification (many do not have clear indications in patch descriptions). Moreover, we constructed proof-of-concepts for two identified bugs (one UAF and one OOB), including one developed to conduct a previously unknown control-flow hijack as further evidence of the correctness of the classification.
SEApr 16, 2025
The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMsHaonan Li, Hang Zhang, Kexin Pei et al.
Static analysis plays a crucial role in software vulnerability detection, yet faces a persistent precision-scalability tradeoff. In large codebases like the Linux kernel, traditional static analysis tools often generate excessive false positives due to simplified vulnerability modeling and overapproximation of path and data constraints. While large language models (LLMs) demonstrate promising code understanding capabilities, their direct application to program analysis remains unreliable due to inherent reasoning limitations. We introduce BugLens, a post-refinement framework that significantly enhances static analysis precision for bug detection. BugLens guides LLMs through structured reasoning steps to assess security impact and validate constraints from the source code. When evaluated on Linux kernel taint-style bugs detected by static analysis tools, BugLens improves precision approximately 7-fold (from 0.10 to 0.72), substantially reducing false positives while uncovering four previously unreported vulnerabilities. Our results demonstrate that a well-structured, fully automated LLM-based workflow can effectively complement and enhance traditional static analysis techniques.
CRNov 11, 2021
SyzScope: Revealing High-Risk Security Impacts of Fuzzer-Exposed Bugs in Linux kernelXiaochen Zou, Guoren Li, Weiteng Chen et al.
Fuzzing has become one of the most effective bug finding approach for software. In recent years, 24*7 continuous fuzzing platforms have emerged to test critical pieces of software, e.g., Linux kernel. Though capable of discovering many bugs and providing reproducers (e.g., proof-of-concepts), a major problem is that they neglect a critical function that should have been built-in, i.e., evaluation of a bug's security impact. It is well-known that the lack of understanding of security impact can lead to delayed bug fixes as well as patch propagation. In this paper, we develop SyzScope, a system that can automatically uncover new "high-risk" impacts given a bug with seemingly "low-risk" impacts. From analyzing over a thousand low-risk bugs on syzbot, SyzScope successfully determined that 183 low-risk bugs (more than 15%) in fact contain high-risk impacts, e.g., control flow hijack and arbitrary memory write, some of which still do not have patches available yet.
CRNov 3, 2020
You Do (Not) Belong Here: Detecting DPI Evasion Attacks with Context LearningShitong Zhu, Shasha Li, Zhongjie Wang et al.
As Deep Packet Inspection (DPI) middleboxes become increasingly popular, a spectrum of adversarial attacks have emerged with the goal of evading such middleboxes. Many of these attacks exploit discrepancies between the middlebox network protocol implementations, and the more rigorous/complete versions implemented at end hosts. These evasion attacks largely involve subtle manipulations of packets to cause different behaviours at DPI and end hosts, to cloak malicious network traffic that is otherwise detectable. With recent automated discovery, it has become prohibitively challenging to manually curate rules for detecting these manipulations. In this work, we propose CLAP, the first fully-automated, unsupervised ML solution to accurately detect and localize DPI evasion attacks. By learning what we call the packet context, which essentially captures inter-relationships across both (1) different packets in a connection; and (2) different header fields within each packet, from benign traffic traces only, CLAP can detect and pinpoint packets that violate the benign packet contexts (which are the ones that are specially crafted for evasion purposes). Our evaluations with 73 state-of-the-art DPI evasion attacks show that CLAP achieves an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.963, an Equal Error Rate (EER) of only 0.061 in detection, and an accuracy of 94.6% in localization. These results suggest that CLAP can be a promising tool for thwarting DPI evasion attacks.
CRAug 8, 2020
PolyScope: Multi-Policy Access Control Analysis to Triage Android SystemsYu-Tsung Lee, William Enck, Haining Chen et al.
Android filesystem access control provides a foundation for Android system integrity. Android utilizes a combination of mandatory (e.g., SEAndroid) and discretionary (e.g., UNIX permissions) access control, both to protect the Android platform from Android/OEM services and to protect Android/OEM services from third-party apps. However, OEMs often create vulnerabilities when they introduce market-differentiating features because they err when re-configuring this complex combination of Android policies. In this paper, we propose the PolyScope tool to triage the combination of Android filesystem access control policies to vet releases for vulnerabilities. The PolyScope approach leverages two main insights: (1) adversaries may exploit the coarse granularity of mandatory policies and the flexibility of discretionary policies to increase the permissions available to launch attacks, which we call permission expansion, and (2) system configurations may limit the ways adversaries may use their permissions to launch attacks, motivating computation of attack operations. We apply PolyScope to three Google and five OEM Android releases to compute the attack operations accurately to vet these releases for vulnerabilities, finding that permission expansion increases the permissions available to launch attacks, sometimes by more than 10X, but a significant fraction of these permissions (about 15-20%) are not convertible into attack operations. Using PolyScope, we find two previously unknown vulnerabilities, showing how PolyScope helps OEMs triage the complex combination of access control policies down to attack operations worthy of testing.
CRJan 29, 2020
A4 : Evading Learning-based AdblockersShitong Zhu, Zhongjie Wang, Xun Chen et al.
Efforts by online ad publishers to circumvent traditional ad blockers towards regaining fiduciary benefits, have been demonstrably successful. As a result, there have recently emerged a set of adblockers that apply machine learning instead of manually curated rules and have been shown to be more robust in blocking ads on websites including social media sites such as Facebook. Among these, AdGraph is arguably the state-of-the-art learning-based adblocker. In this paper, we develop A4, a tool that intelligently crafts adversarial samples of ads to evade AdGraph. Unlike the popular research on adversarial samples against images or videos that are considered less- to un-restricted, the samples that A4 generates preserve application semantics of the web page, or are actionable. Through several experiments we show that A4 can bypass AdGraph about 60% of the time, which surpasses the state-of-the-art attack by a significant margin of 84.3%; in addition, changes to the visual layout of the web page due to these perturbations are imperceptible. We envision the algorithmic framework proposed in A4 is also promising in improving adversarial attacks against other learning-based web applications with similar requirements.
CROct 22, 2018
IoTSan: Fortifying the Safety of IoT SystemsDang Tu Nguyen, Chengyu Song, Zhiyun Qian et al.
Today's IoT systems include event-driven smart applications (apps) that interact with sensors and actuators. A problem specific to IoT systems is that buggy apps, unforeseen bad app interactions, or device/communication failures, can cause unsafe and dangerous physical states. Detecting flaws that lead to such states, requires a holistic view of installed apps, component devices, their configurations, and more importantly, how they interact. In this paper, we design IoTSan, a novel practical system that uses model checking as a building block to reveal "interaction-level" flaws by identifying events that can lead the system to unsafe states. In building IoTSan, we design novel techniques tailored to IoT systems, to alleviate the state explosion associated with model checking. IoTSan also automatically translates IoT apps into a format amenable to model checking. Finally, to understand the root cause of a detected vulnerability, we design an attribution mechanism to identify problematic and potentially malicious apps. We evaluate IoTSan on the Samsung SmartThings platform. From 76 manually configured systems, IoTSan detects 147 vulnerabilities. We also evaluate IoTSan with malicious SmartThings apps from a previous effort. IoTSan detects the potential safety violations and also effectively attributes these apps as malicious.
CYMay 22, 2018
AdGraph: A Graph-Based Approach to Ad and Tracker BlockingUmar Iqbal, Peter Snyder, Shitong Zhu et al.
User demand for blocking advertising and tracking online is large and growing. Existing tools, both deployed and described in research, have proven useful, but lack either the completeness or robustness needed for a general solution. Existing detection approaches generally focus on only one aspect of advertising or tracking (e.g. URL patterns, code structure), making existing approaches susceptible to evasion. In this work we present AdGraph, a novel graph-based machine learning approach for detecting advertising and tracking resources on the web. AdGraph differs from existing approaches by building a graph representation of the HTML structure, network requests, and JavaScript behavior of a webpage, and using this unique representation to train a classifier for identifying advertising and tracking resources. Because AdGraph considers many aspects of the context a network request takes place in, it is less susceptible to the single-factor evasion techniques that flummox existing approaches. We evaluate AdGraph on the Alexa top-10K websites, and find that it is highly accurate, able to replicate the labels of human-generated filter lists with 95.33% accuracy, and can even identify many mistakes in filter lists. We implement AdGraph as a modification to Chromium. AdGraph adds only minor overhead to page loading and execution, and is actually faster than stock Chromium on 42% of websites and AdBlock Plus on 78% of websites. Overall, we conclude that AdGraph is both accurate enough and performant enough for online use, breaking comparable or fewer websites than popular filter list based approaches.
CRMay 19, 2016
A First Look at Ad-block Detection: A New Arms Race on the WebMuhammad Haris Mughees, Zhiyun Qian, Zubair Shafiq et al.
The rise of ad-blockers is viewed as an economic threat by online publishers, especially those who primarily rely on ad- vertising to support their services. To address this threat, publishers have started retaliating by employing ad-block detectors, which scout for ad-blocker users and react to them by restricting their content access and pushing them to whitelist the website or disabling ad-blockers altogether. The clash between ad-blockers and ad-block detectors has resulted in a new arms race on the web. In this paper, we present the first systematic measurement and analysis of ad-block detection on the web. We have designed and implemented a machine learning based tech- nique to automatically detect ad-block detection, and use it to study the deployment of ad-block detectors on Alexa top- 100K websites. The approach is promising with precision of 94.8% and recall of 93.1%. We characterize the spectrum of different strategies used by websites for ad-block detection. We find that most of publishers use fairly simple passive ap- proaches for ad-block detection. However, we also note that a few websites use third-party services, e.g. PageFair, for ad-block detection and response. The third-party services use active deception and other sophisticated tactics to de- tect ad-blockers. We also find that the third-party services can successfully circumvent ad-blockers and display ads on publisher websites.
SINov 18, 2015
Behavior Query Discovery in System-Generated Temporal GraphsBo Zong, Xusheng Xiao, Zhichun Li et al.
Computer system monitoring generates huge amounts of logs that record the interaction of system entities. How to query such data to better understand system behaviors and identify potential system risks and malicious behaviors becomes a challenging task for system administrators due to the dynamics and heterogeneity of the data. System monitoring data are essentially heterogeneous temporal graphs with nodes being system entities and edges being their interactions over time. Given the complexity of such graphs, it becomes time-consuming for system administrators to manually formulate useful queries in order to examine abnormal activities, attacks, and vulnerabilities in computer systems. In this work, we investigate how to query temporal graphs and treat query formulation as a discriminative temporal graph pattern mining problem. We introduce TGMiner to mine discriminative patterns from system logs, and these patterns can be taken as templates for building more complex queries. TGMiner leverages temporal information in graphs to prune graph patterns that share similar growth trend without compromising pattern quality. Experimental results on real system data show that TGMiner is 6-32 times faster than baseline methods. The discovered patterns were verified by system experts; they achieved high precision (97%) and recall (91%).