CRAug 4, 2024Code
ARVO: Atlas of Reproducible Vulnerabilities for Open Source SoftwareXiang Mei, Pulkit Singh Singaria, Jordi Del Castillo et al.
High-quality datasets of real-world vulnerabilities are enormously valuable for downstream research in software security, but existing datasets are typically small, require extensive manual effort to update, and are missing crucial features that such research needs. In this paper, we introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software. By sourcing vulnerabilities from C/C++ projects that Google's OSS-Fuzz discovered and implementing a reliable re-compilation system, we successfully reproduce more than 5,000 memory vulnerabilities across over 250 projects, each with a triggering input, the canonical developer-written patch for fixing the vulnerability, and the ability to automatically rebuild the project from source and run it at its vulnerable and patched revisions. Moreover, our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities, allowing it to grow over time. We provide a thorough characterization of the ARVO dataset, show that it can locate fixes more accurately than Google's own OSV reproduction effort, and demonstrate its value for future research through two case studies: firstly evaluating real-world LLM-based vulnerability repair, and secondly identifying over 300 falsely patched (still-active) zero-day vulnerabilities from projects improperly labeled by OSS-Fuzz.
CRMay 5
Root-Cause-Driven Automated Vulnerability RepairHulin Wang, Zion Leonahenahe Basque, Jie Hu et al.
Recent LLM-based systems have made automated vulnerability repair increasingly practical, but two challenges remain. First, without strong signals about where a bug originates, repair agents drift toward shallow edits that silence the observed failure while leaving the underlying defect unresolved. Second, finding the root cause for bugs is hard: even developers familiar with the codebase frequently produce fixes that address symptoms rather than the root cause, and LLM-based agents, operating with noisier context and less program understanding, are no exception. We present Kumushi, a root-cause-driven patching agent that addresses both challenges by combining diversified dynamic fault localization with evidence-weighted ranking to focus the LLM on the code most relevant to the defect. To rigorously measure whether Kumushi produces genuinely better patches, we also introduce a two-tier patch quality metric that pairs automated oracle validation with structured expert assessment of patches. Evaluated on 178 C/C++ vulnerabilities, Kumushi substantially outperforms prior specialized repair agents under automated evaluation while matching a frontier commercial coding agent. Expert assessment then reveals differences that oracles cannot: Kumushi produces more root-cause fixes and fewer superficial patches, and is preferred in the majority of decisive pairwise comparisons. Together, these results demonstrate that progress in automated vulnerability repair requires not only stronger patching systems, but also richer evaluation methods capable of distinguishing genuine fixes from oracle-passing ones.
SESep 27, 2025Code
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source SoftwareZehua Zhang, Ati Priya Bajaj, Divij Handa et al.
Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.
CRFeb 24, 2022
Automatically Mitigating Vulnerabilities in Binary Programs via Partially Recompilable DecompilationPemma Reiter, Hui Jun Tay, Westley Weimer et al.
Vulnerabilities are challenging to locate and repair, especially when source code is unavailable and binary patching is required. Manual methods are time-consuming, require significant expertise, and do not scale to the rate at which new vulnerabilities are discovered. Automated methods are an attractive alternative, and we propose Partially Recompilable Decompilation (PRD). PRD lifts suspect binary functions to source, available for analysis, revision, or review, and creates a patched binary using source- and binary-level techniques. Although decompilation and recompilation do not typically work on an entire binary, our approach succeeds because it is limited to a few functions, like those identified by our binary fault localization. We evaluate these assumptions and find that, without any grammar or compilation restrictions, 70-89% of individual functions are successfully decompiled and recompiled with sufficient type recovery. In comparison, only 1.7% of the full C-binaries succeed. When decompilation succeeds, PRD produces test-equivalent binaries 92.9% of the time. In addition, we evaluate PRD in two contexts: a fully automated process incorporating source-level Automated Program Repair (APR) methods; human-edited source-level repairs. When evaluated on DARPA Cyber Grand Challenge (CGC) binaries, we find that PRD-enabled APR tools, operating only on binaries, performs as well as, and sometimes better than full-source tools, collectively mitigating 85 of the 148 scenarios, a success rate consistent with these same tools operating with access to the entire source code. PRD achieves similar success rates as the winning CGC entries, sometimes finding higher-quality mitigations than those produced by top CGC teams. For generality, our evaluation includes two independently developed APR tools and C++, Rode0day, and real-world binaries.
CRMar 23, 2021
Scam Pandemic: How Attackers Exploit Public Fear through PhishingMarzieh Bitaab, Haehyun Cho, Adam Oest et al.
As the COVID-19 pandemic started triggering widespread lockdowns across the globe, cybercriminals did not hesitate to take advantage of users' increased usage of the Internet and their reliance on it. In this paper, we carry out a comprehensive measurement study of online social engineering attacks in the early months of the pandemic. By collecting, synthesizing, and analyzing DNS records, TLS certificates, phishing URLs, phishing website source code, phishing emails, web traffic to phishing websites, news articles, and government announcements, we track trends of phishing activity between January and May 2020 and seek to understand the key implications of the underlying trends. We find that phishing attack traffic in March and April 2020 skyrocketed up to 220\% of its pre-COVID-19 rate, far exceeding typical seasonal spikes. Attackers exploited victims' uncertainty and fear related to the pandemic through a variety of highly targeted scams, including emerging scam types against which current defenses are not sufficient as well as traditional phishing which outpaced the ecosystem's collective response.
CRJun 22, 2020
You shall not pass: Mitigating SQL Injection Attacks on Legacy Web ApplicationsRasoul Jahanshahi, Adam Doupé, Manuel Egele
SQL injection (SQLi) attacks pose a significant threat to the security of web applications. Existing approaches do not support object-oriented programming that renders these approaches unable to protect the real-world web apps such as Wordpress, Joomla, or Drupal against SQLi attacks. We propose a novel hybrid static-dynamic analysis for PHP web applications that limits each PHP function for accessing the database. Our tool, SQLBlock, reduces the attack surface of the vulnerable PHP functions in a web application to a set of query descriptors that demonstrate the benign functionality of the PHP function. We implement SQLBlock as a plugin for MySQL and PHP. Our approach does not require any modification to the web app. W evaluate SQLBlock on 11 SQLi vulnerabilities in Wordpress, Joomla, Drupal, Magento, and their plugins. We demonstrate that SQLBlock successfully prevents all 11 SQLi exploits with negligible performance overhead (i.e., a maximum of 3% on a heavily-loaded web server)
CRFeb 23, 2016
Moving Target Defense for Web Applications using Bayesian Stackelberg GamesSailik Sengupta, Satya Gautam Vadlamudi, Subbarao Kambhampati et al.
The present complexity in designing web applications makes software security a difficult goal to achieve. An attacker can explore a deployed service on the web and attack at his/her own leisure. Moving Target Defense (MTD) in web applications is an effective mechanism to nullify this advantage of their reconnaissance but the framework demands a good switching strategy when switching between multiple configurations for its web-stack. To address this issue, we propose modeling of a real-world MTD web application as a repeated Bayesian game. We then formulate an optimization problem that generates an effective switching strategy while considering the cost of switching between different web-stack configurations. To incorporate this model into a developed MTD system, we develop an automated system for generating attack sets of Common Vulnerabilities and Exposures (CVEs) for input attacker types with predefined capabilities. Our framework obtains realistic reward values for the players (defenders and attackers) in this game by using security domain expertise on CVEs obtained from the National Vulnerability Database (NVD). We also address the issue of prioritizing vulnerabilities that when fixed, improves the security of the MTD system. Lastly, we demonstrate the robustness of our proposed model by evaluating its performance when there is uncertainty about input attacker information.