56.4CRMay 22
An Empirical Evaluation of LLM-Generated Code Security Across Prompting MethodsMohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh et al.
The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at the cost of security. Generated code frequently overlooks critical concerns, leaving it vulnerable to issues such as weak encryption and improper input validation. To investigate this problem, we present a comprehensive empirical evaluation of the security quality of LLM-generated code across five LLMs and four programming languages (Java, C++, C, and Python), examining the impact of multiple prompt engineering methods. We introduce a weaknesses-aware zero-shot chain-of-thought (WA-0CoT) prompting strategy that enriches prompts with security context using CWE mappings to guide model reasoning. Our empirical analysis, supported by chi-square tests, finds no statistically significant reductions in vulnerability frequency or density across prompt methods. However, prompting strategies, including WA-0CoT, systematically influence the compositional distribution of CWE categories, with effects varying by programming language. These findings suggest that while security-aware prompting alters the structure of generated weaknesses, prompt engineering alone is insufficient to reliably reduce overall vulnerability levels. The results highlight the importance of language-aware and model-aware prompt design when evaluating the security properties of LLM-generated code.
80.8CRMay 22
Enhancing Reliability in LLM-Based Secure Code GenerationMohammed F. Kharma, Mohammad Alkhanafseh, Ahmed Sabbah et al.
Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and prompting strategies. Existing prompt engineering improves functional correctness but rarely ensures consistent security outcomes. We introduce the \textit{Mitigation-Aware Chain-of-Thought (MA-CoT)} framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities in generated code. We evaluate MA-CoT across three LLMs (gpt-5, claude-4.5, gemini-2.5), three programming languages (C, Java, Python), and four prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on a 200-task primary dataset, with external validation on LLMSecEval. Using static analysis with expert validation, MA-CoT reduces total security findings from 92 to 39 (57.6\%) on the primary dataset and from 73 to 4 (94.5\%) on LLMSecEval. High-severity findings (Blocker + Critical) drop from 90 to 39 (56.7\%) and from 45 to 2 (95.6\%), respectively. Across both datasets, MA-CoT is the only strategy that consistently improves security reliability; Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. We further introduce a strict layered attribution of vulnerability drivers (language-core vs. stack layers) and show that residual risk concentrates in hardening-oriented patterns (e.g., OS- and toolchain-dependent), motivating secure-by-construction primitives alongside prompting.
63.5CRMay 22
Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware DetectionAhmed Sabbah, Mohammad Kharma, Mohammad Alkhanafseh et al.
Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly. We propose a chronological adaptive maintenance framework that models deployment-time maintenance as a sequential decision problem. The framework learns a stable latent representation through self-supervised learning during initialization, freezes the encoder, measures latent drift in the fixed representation space, and performs lightweight downstream adaptation using a trainable adapter and classification head. A proximal policy optimization controller selects low-cost maintenance actions based on the detector state, including current utility, retention on a fixed memory set, latent drift indicators, and update cost. We evaluate the framework under a causal deployment-style protocol on emulator and real Android malware datasets with static and dynamic features. Results show that the RL controller provides a strong cost-aware adaptation strategy, consistently remaining among the top-performing policies while achieving a favorable balance between temporal performance, memory retention, and maintenance cost under non-stationary deployment conditions.
67.2CRApr 20
A Quasi-Experimental Developer Study of Security Training in LLM-Assisted Web Application DevelopmentMohammed Kharma, Ahmed Sabbah, Radi Jarrar et al.
This paper presents a controlled quasi-experimental developer study examining whether a layer-based security training package is associated with improved security quality in LLM-assisted implementation of an identity-centric Java Spring Boot backend. The study uses a mixed design with a within-subject pre-training versus post-training comparison and an exploratory between-subject expertise factor. Twelve developers completed matched runs under a common interface, fixed model configuration, counterbalanced task sets, and a shared starter project. Security outcomes were assessed via independent manual validation of submitted repositories by the first and second authors. The primary participant-level endpoint was a severity-weighted validated-weakness score. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test ($p = 0.0059$). In aggregate, validated weaknesses decreased from 162 to 111 (31.5\%), the severity-weighted burden decreased from 432 to 267 (38.2\%), and critical findings decreased from 24 to 5 (79.2\%). The largest reductions were in authorization and object access (53.3\%) and in authentication, credential policy, and recovery weaknesses (44.7\%). Session and browser trust-boundary issues showed minimal change, while sensitive-data and cryptographic weaknesses showed only marginal improvement. These results suggest that, under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. They do not support replacing secure defaults, static analysis, expert review, or operational hardening.