SE AI CRFeb 9, 2025

Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models

Marc Bruni, Fabio Gabrielli, Mohammad Ghafari, Martin Kropp

arXiv:2502.06039v120.424 citationsh-index: 22025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)

Originality Highly original

AI Analysis

This research addresses a significant problem for developers and users of Large Language Models, particularly those using GPT models for secure code generation, by providing a benchmark and effective prompt engineering techniques.

The authors tackled the problem of mitigating vulnerabilities in code generated by Large Language Models, achieving a reduction of security vulnerabilities by up to 56% using a security-focused prompt prefix. They also found that iterative prompting techniques can detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code.

Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a "prompt agent" that demonstrates how the most effective techniques can be applied in real-world development workflows.

View on arXiv PDF

Similar