SEJul 8, 2024Code
What's Wrong with Your Code Generated by Large Language Models? An Extensive StudyShihan Dou, Haoxiang Jia, Shenxi Wu et al.
The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.
SDMay 18Code
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and OutlookKaiwen Luo, Zhenhong Zhou, Leo Wang et al.
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.
LGMar 29Code
Robust Smart Contract Vulnerability Detection via Contrastive Learning-Enhanced Granular-ball TrainingZeli Wang, Qingxuan Yang, Shuyin Xia et al.
Deep neural networks (DNNs) have emerged as a prominent approach for detecting smart contract vulnerabilities, driven by the growing contract datasets and advanced deep learning techniques. However, DNNs typically require large-scale labeled datasets to model the relationships between contract features and vulnerability labels. In practice, the labeling process often depends on existing open-sourced tools, whose accuracy cannot be guaranteed. Consequently, label noise poses a significant challenge for the accuracy and robustness of the smart contract, which is rarely explored in the literature. To this end, we propose Contrastive learning-enhanced Granular-Ball smart Contracts training, CGBC, to enhance the robustness of contract vulnerability detection. Specifically, CGBC first introduces a Granular-ball computing layer between the encoder layer and the classifier layer, to group similar contracts into Granular-Balls (GBs) and generate new coarse-grained representations (i.e., the center and the label of GBs) for them, which can correct noisy labels based on the most correct samples. An inter-GB compactness loss and an intra-GB looseness loss are combined to enhance the effectiveness of clustering. Then, to improve the accuracy of GBs, we pretrain the model through unsupervised contrastive learning supported by our novel semantic-consistent smart contract augmentation method. This procedure can discriminate contracts with different labels by dragging the representation of similar contracts closer, assisting CGBC in clustering. Subsequently, we leverage the symmetric cross-entropy loss function to measure the model quality, which can combat the label noise in gradient computations. Finally, extensive experiments show that the proposed CGBC can significantly improve the robustness and effectiveness of the smart contract vulnerability detection when contrasted with baselines.
CRDec 12, 2021Code
Boosting the Capability of Intelligent Vulnerability Detection by Training in a Human-Learning MannerShihan Dou, Yueming Wu, Wenxuan Li et al.
Due to its powerful automatic feature extraction, deep learning (DL) has been widely used in source code vulnerability detection. However, although it performs well on artificial datasets, its performance is not satisfactory when detecting real-world vulnerabilities due to the high complexity of real-world samples. In this paper, we propose to train DL-based vulnerability detection models in a human-learning manner, that is, start with the simplest samples and then gradually transition to difficult knowledge. Specifically, we design a novel framework (Humer) that can enhance the detection ability of DL-based vulnerability detectors. To validate the effectiveness of Humer, we select five state-of-the-art DL-based vulnerability detection models (TokenCNN, VulDeePecker, StatementGRU, ASTGRU, and Devign) to complete our evaluations. Through the results, we find that the use of Humer can increase the F1 of these models by an average of 10.5%. Moreover, Humer can make the model detect up to 16.7% more real-world vulnerabilities. Meanwhile, we also conduct a case study to uncover vulnerabilities from real-world open source products by using these enhanced DL-based vulnerability detectors. Through the results, we finally discover 281 unreported vulnerabilities in NVD, of which 98 have been silently patched by vendors in the latest version of corresponding products, but 159 still exist in the products.
CRJul 8, 2021Code
Contrastive Learning for Robust Android Malware Familial ClassificationYueming Wu, Shihan Dou, Deqing Zou et al.
Due to its open-source nature, Android operating system has been the main target of attackers to exploit. Malware creators always perform different code obfuscations on their apps to hide malicious activities. Features extracted from these obfuscated samples through program analysis contain many useless and disguised features, which leads to many false negatives. To address the issue, in this paper, we demonstrate that obfuscation-resilient malware family analysis can be achieved through contrastive learning. The key insight behind our analysis is that contrastive learning can be used to reduce the difference introduced by obfuscation while amplifying the difference between malware and other types of malware. Based on the proposed analysis, we design a system that can achieve robust and interpretable classification of Android malware. To achieve robust classification, we perform contrastive learning on malware samples to learn an encoder that can automatically extract robust features from malware samples. To achieve interpretable classification, we transform the function call graph of a sample into an image by centrality analysis. Then the corresponding heatmaps can be obtained by visualization techniques. These heatmaps can help users understand why the malware is classified as this family. We implement \emph{IFDroid} and perform extensive evaluations on two datasets. Experimental results show that \emph{IFDroid} is superior to state-of-the-art Android malware familial classification systems. Moreover, \emph{IFDroid} is capable of maintaining a 98.4\% F1 on classifying 69,421 obfuscated malware samples.
AIMay 7
Towards Security-Auditable LLM Agents: A Unified Graph RepresentationChaofan Li, Lyuye Zhang, Jintao Zhai et al.
LLM-based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi-agent collaboration. However, this semantics-driven execution paradigm creates a severe semantic gap between low-level physical events and high-level execution intent, making post-hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive-state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent-BOM, a unified structural representation for agent security auditing. Agent-BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long-term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent-BOM, we develop a graph-query-based paradigm for path-level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent-BOM from live executions. Evaluation on representative real-world agentic attack scenarios shows that Agent-BOM can reconstruct stealthy attack chains, including cross-session memory poisoning and tool misuse, capability supply-chain hijacking and unexpected code execution, multi-agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent-BOM provides a unified and auditable foundation for root-cause analysis and security adjudication in complex agentic ecosystems.
SEOct 30, 2024
Multi-Programming Language Sandbox for LLMsShihan Dou, Jiazheng Zhang, Jianxiang Zang et al.
We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox.
SENov 25, 2024
An Empirical Study of Vulnerability Detection using Federated LearningPeiheng Zhou, Ming Hu, Xingrun Quan et al.
Although Deep Learning (DL) methods becoming increasingly popular in vulnerability detection, their performance is seriously limited by insufficient training data. This is mainly because few existing software organizations can maintain a complete set of high-quality samples for DL-based vulnerability detection. Due to the concerns about privacy leakage, most of them are reluctant to share data, resulting in the data silo problem. Since enables collaboratively model training without data sharing, Federated Learning (FL) has been investigated as a promising means of addressing the data silo problem in DL-based vulnerability detection. However, since existing FL-based vulnerability detection methods focus on specific applications, it is still far unclear i) how well FL adapts to common vulnerability detection tasks and ii) how to design a high-performance FL solution for a specific vulnerability detection task. To answer these two questions, this paper first proposes VulFL, an effective evaluation framework for FL-based vulnerability detection. Then, based on VulFL, this paper conducts a comprehensive study to reveal the underlying capabilities of FL in dealing with different types of CWEs, especially when facing various data heterogeneity scenarios. Our experimental results show that, compared to independent training, FL can significantly improve the detection performance of common AI models on all investigated CWEs, though the performance of FL-based vulnerability detection is limited by heterogeneous data. To highlight the performance differences between different FL solutions for vulnerability detection, we extensively investigate the impacts of different configuration strategies for each framework component of VulFL. Our study sheds light on the potential of FL in vulnerability detection, which can be used to guide the design of FL-based solutions for vulnerability detection.
CLFeb 16, 2022
Decorrelate Irrelevant, Purify Relevant: Overcome Textual Spurious Correlations from a Feature PerspectiveShihan Dou, Rui Zheng, Ting Wu et al.
Natural language understanding (NLU) models tend to rely on spurious correlations (i.e., dataset bias) to achieve high performance on in-distribution datasets but poor performance on out-of-distribution ones. Most of the existing debiasing methods often identify and weaken these samples with biased features (i.e., superficial surface features that cause such spurious correlations). However, down-weighting these samples obstructs the model in learning from the non-biased parts of these samples. To tackle this challenge, in this paper, we propose to eliminate spurious correlations in a fine-grained manner from a feature space perspective. Specifically, we introduce Random Fourier Features and weighted re-sampling to decorrelate the dependencies between features to mitigate spurious correlations. After obtaining decorrelated features, we further design a mutual-information-based method to purify them, which forces the model to learn features that are more relevant to tasks. Extensive experiments on two well-studied NLU tasks demonstrate that our method is superior to other comparative approaches.
CRJul 10, 2021
HomDroid: Detecting Android Covert Malware by Social-Network Homophily AnalysisYueming Wu, Deqing Zou, Wei Yang et al.
Android has become the most popular mobile operating system. Correspondingly, an increasing number of Android malware has been developed and spread to steal users' private information. There exists one type of malware whose benign behaviors are developed to camouflage malicious behaviors. The malicious component occupies a small part of the entire code of the application (app for short), and the malicious part is strongly coupled with the benign part. In this case, the malware may cause false negatives when malware detectors extract features from the entire apps to conduct classification because the malicious features of these apps may be hidden among benign features. Moreover, some previous work aims to divide the entire app into several parts to discover the malicious part. However, the premise of these methods to commence app partition is that the connections between the normal part and the malicious part are weak. In this paper, we call this type of malware as Android covert malware and generate the first dataset of covert malware. To detect them, we first conduct static analysis to extract the call graphs. Through the deep analysis on graphs, we observe that although the correlations between the normal part and the malicious part in these graphs are high, the degree of these correlations has a distribution. Based on the observation, we design HomDroid to detect covert malware by analyzing the homophily of call graphs. We identify the ideal threshold of correlation to distinguish the normal part and the malicious part based on the evaluation results on a dataset of 4,840 benign apps and 3,385 covert malicious apps. According to our evaluation results, HomDroid is capable of detecting 96.8% of covert malware while the False Negative Rates of another four state-of-the-art systems (i.e., PerDroid, Drebin, MaMaDroid, and IntDroid) are 30.7%, 16.3%, 15.2%, and 10.4%, respectively.