SEApr 20
Weaponizing the Commons: A Taxonomy and Detection Framework of Abuse on GitHubYuli Cheng, Xiaoyu Zhang, Jiongchi Yu et al.
GitHub plays a critical role in modern software supply chains, making its security an important research concern. Existing studies have primarily focused on CI/CD automation, collaboration patterns, and community management, while abuse behaviors on GitHub have received little systematic investigation. In this paper, we systematically review and summarize reported GitHub abuse behaviors and conduct an empirical analysis of publicly available abuse cases, curating a manually labeled dataset of 392 GitHub instances. Based on this investigation, we propose a comprehensive taxonomy that characterizes their diverse symptoms and root causes from a software security perspective. Building on this taxonomy, we develop a unified detection framework capable of identifying all abuse categories across repositories and user accounts. Evaluated on the constructed dataset, the proposed framework achieves high performance across all categories (e.g., F1-score exceeding 89%). Collectively, this work advances the understanding of GitHub abuse behaviors and lays the groundwork for large-scale, systematic analysis of the GitHub platform to strengthen software supply chain security.
AIMar 9
DynaTrust: Defending Multi-Agent Systems Against Sleeper Agents via Dynamic Trust GraphsYu Li, Qiang Hu, Yao Zhang et al.
Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable collaborative reasoning capabilities but introduce new attack surfaces, such as the sleeper agent, which behave benignly during routine operation and gradually accumulate trust, only revealing malicious behaviors when specific conditions or triggers are met. Existing defense works primarily focus on static graph optimization or hierarchical data management, often failing to adapt to evolving adversarial strategies or suffering from high false-positive rates (FPR) due to rigid blocking policies. To address this, we propose DynaTrust, a novel defense method against sleeper agents. DynaTrust models MAS as a dynamic trust graph~(DTG), and treats trust as a continuous, evolving process rather than a static attribute. It dynamically updates the trust of each agent based on its historical behaviors and the confidence of selected expert agents. Instead of simply blocking, DynaTrust autonomously restructures the graph to isolate compromised agents and restore task connectivity to ensure the usability of MAS. To assess the effectiveness of DynaTrust, we evaluate it on mixed benchmarks derived from AdvBench and HumanEval. The results demonstrate that DynaTrust outperforms the state-of-the-art method AgentShield by increasing the defense success rate by 41.7%, achieving rates exceeding 86% under adversarial conditions. Furthermore, it effectively balances security with utility by significantly reducing FPR, ensuring uninterrupted system operations through graph adaptation.
SEMar 12
Human in the Loop for Fuzz Testing: Literature Review and the Road AheadJiongchi Yu, Xiaolin Wen, Sizhe Cheng et al.
Fuzz testing is one of the most effective techniques for detecting bugs and vulnerabilities in software. However, as the basis of fuzz testing, automated heuristics often fail to uncover deep or complex vulnerabilities. As a result, the performance of fuzz testing remains limited. One promising way to address this limitation is to integrate human expert guidance into the paradigm of fuzz testing. Even though some works have been proposed in this direction, there is still a lack of a systematic research roadmap for combining Human-in-the-Loop (HITL) and fuzz testing, hindering the potential for further enhancing fuzzing effectiveness. To bridge this gap, this paper outlines a forward-looking research roadmap for HITL for fuzz testing. Specifically, we highlight the promise of visualization techniques for interpretable fuzzing processes, as well as on-the-fly interventions that enable experts to guide fuzzing toward hard-to-reach program behaviors. Moreover, the rise of Large Language Models (LLMs) introduces new opportunities and challenges, raising questions about how humans can efficiently provide actionable knowledge, how expert meta-knowledge can be leveraged, and what roles humans should play in the intelligent fuzzing loop with LLMs. To address these questions, we survey existing work on HITL fuzz testing and propose a research agenda emphasizing future opportunities in (1) human monitoring, (2) human steering, and (3) human-LLM collaboration. We call for a paradigm shift toward interactive, human-guided fuzzing systems that integrate expert insight with AI-powered automation in the next-generation fuzzing ecosystem.
SEOct 6, 2025Code
AutoEmpirical: LLM-Based Automated Research for Empirical Software Fault AnalysisJiongchi Yu, Weipeng Jiang, Xiaoyu Zhang et al.
Understanding software faults is essential for empirical research in software development and maintenance. However, traditional fault analysis, while valuable, typically involves multiple expert-driven steps such as collecting potential faults, filtering, and manual investigation. These processes are both labor-intensive and time-consuming, creating bottlenecks that hinder large-scale fault studies in complex yet critical software systems and slow the pace of iterative empirical research. In this paper, we decompose the process of empirical software fault study into three key phases: (1) research objective definition, (2) data preparation, and (3) fault analysis, and we conduct an initial exploration study of applying Large Language Models (LLMs) for fault analysis of open-source software. Specifically, we perform the evaluation on 3,829 software faults drawn from a high-quality empirical study. Our results show that LLMs can substantially improve efficiency in fault analysis, with an average processing time of about two hours, compared to the weeks of manual effort typically required. We conclude by outlining a detailed research plan that highlights both the potential of LLMs for advancing empirical fault studies and the open challenges that required be addressed to achieve fully automated, end-to-end software fault analysis.
CRMay 5
ARGUS: Defending LLM Agents Against Context-Aware Prompt InjectionShihao Weng, Yang Feng, Jinrui Zhang et al.
The rise of Large Language Model (LLM) agents, augmented with tool use, skills, and external knowledge, has introduced new security risks. Among them, prompt injection attacks, where adversaries embed malicious instructions into the agent workflow, have emerged as the primary threat. However, existing benchmarks and defenses are fundamentally limited as they assume context-insensitive settings in which the agent works under a fully specified user instruction, and the attacks are straightforward and context-independent. As a result, they fail to capture real-world deployments where agent behavior usually depends on dynamic context, not just the user prompt, and adversaries can adapt their attacks to different context. Similarly, existing defenses built on this narrow threat model overlook the nature of real-world agent delegation. In this paper, we present AgentLure, a benchmark that captures context-dependent tasks and context-aware prompt injection attacks. AgentLure spans four agentic domains and eight attack vectors across diverse attack surfaces. Our evaluation shows that existing defenses often struggle in this setting, yielding poor performance against such attacks in agentic systems. To address this limitation, we propose ARGUS, a defense mechanism that enforces provenance-aware decision auditing for LLM agents. ARGUS constructs an influence provenance graph to track how untrusted context propagates into agent decisions and verify whether a decision is justified by trustworthy evidence before execution. Our evaluation shows ARGUS reduces attack success rate to 3.8% while preserving 87.5% task utility, significantly outperforming existing defenses and remaining robust against adaptive white-box adversaries.
SEJun 14, 2025
The Foundation Cracks: A Comprehensive Study on Bugs and Testing Practices in LLM LibrariesWeipeng Jiang, Xiaoyu Zhang, Xiaofei Xie et al.
Large Language Model (LLM) libraries have emerged as the foundational infrastructure powering today's AI revolution, serving as the backbone for LLM deployment, inference optimization, fine-tuning, and production serving across diverse applications. Despite their critical role in the LLM ecosystem, these libraries face frequent quality issues and bugs that threaten the reliability of AI systems built upon them. To address this knowledge gap, we present the first comprehensive empirical investigation into bug characteristics and testing practices in modern LLM libraries. We examine 313 bug-fixing commits extracted across two widely-adopted LLM libraries: HuggingFace Transformers and vLLM.Through rigorous manual analysis, we establish comprehensive taxonomies categorizing bug symptoms into 5 types and root causes into 14 distinct categories.Our primary discovery shows that API misuse has emerged as the predominant root cause (32.17%-48.19%), representing a notable transition from algorithm-focused defects in conventional deep learning frameworks toward interface-oriented problems. Additionally, we examine 7,748 test functions to identify 7 distinct test oracle categories employed in current testing approaches, with predefined expected outputs (such as specific tensors and text strings) being the most common strategy. Our assessment of existing testing effectiveness demonstrates that the majority of bugs escape detection due to inadequate test cases (41.73%), lack of test drivers (32.37%), and weak test oracles (25.90%). Drawing from these findings, we offer some recommendations for enhancing LLM library quality assurance.
SEAug 25, 2025
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated CodeKeke Lian, Bin Wang, Lei Zhang et al.
The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks often lack relevance to real-world AI-assisted programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in production environments. To address this gap, we introduce A.S.E (AI Code Generation Security Evaluation), a repository-level evaluation benchmark designed to closely mirror real-world AI programming tasks, offering a comprehensive and reliable framework for assessing the security of AI-generated code. Our evaluation of leading LLMs on A.S.E reveals several key findings. In particular, current LLMs still struggle with secure coding. The complexity in repository-level scenarios presents challenges for LLMs that typically perform well on snippet-level tasks. Moreover, a larger reasoning budget does not necessarily lead to better code generation. These observations offer valuable insights into the current state of AI code generation and help developers identify the most suitable models for practical tasks. They also lay the groundwork for refining LLMs to generate secure and efficient code in real-world applications.
CRAug 11, 2025
Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat SimulationJiongchi Yu, Xiaofei Xie, Qiang Hu et al.
Insider threats, which can lead to severe losses, remain a major security concern. While machine learning-based insider threat detection (ITD) methods have shown promising results, their progress is hindered by the scarcity of high-quality data. Enterprise data is sensitive and rarely accessible, while publicly available datasets, when limited in scale due to cost, lack sufficient real-world coverage; and when purely synthetic, they fail to capture rich semantics and realistic user behavior. To address this, we propose Chimera, the first large language model (LLM)-based multi-agent framework that automatically simulates both benign and malicious insider activities and collects diverse logs across diverse enterprise environments. Chimera models each employee with agents that have role-specific behavior and integrates modules for group meetings, pairwise interactions, and autonomous scheduling, capturing realistic organizational dynamics. It incorporates 15 types of insider attacks (e.g., IP theft, system sabotage) and has been deployed to simulate activities in three sensitive domains: technology company, finance corporation, and medical institution, producing a new dataset, ChimeraLog. We assess ChimeraLog via human studies and quantitative analysis, confirming its diversity, realism, and presence of explainable threat patterns. Evaluations of existing ITD methods show an average F1-score of 0.83, which is significantly lower than 0.99 on the CERT dataset, demonstrating ChimeraLog's higher difficulty and utility for advancing ITD research.