Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models
This research addresses a significant problem for developers and users of Large Language Models, particularly those using GPT models for secure code generation, by providing a benchmark and effective prompt engineering techniques.
The authors tackled the problem of mitigating vulnerabilities in code generated by Large Language Models, achieving a reduction of security vulnerabilities by up to 56% using a security-focused prompt prefix. They also found that iterative prompting techniques can detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code.
Prompt engineering reduces reasoning mistakes in Large Language Models (LLMs). However, its effectiveness in mitigating vulnerabilities in LLM-generated code remains underexplored. To address this gap, we implemented a benchmark to automatically assess the impact of various prompt engineering strategies on code security. Our benchmark leverages two peer-reviewed prompt datasets and employs static scanners to evaluate code security at scale. We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. Our results show that for GPT-4o and GPT-4o-mini, a security-focused prompt prefix can reduce the occurrence of security vulnerabilities by up to 56%. Additionally, all tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code when using iterative prompting techniques. Finally, we introduce a "prompt agent" that demonstrates how the most effective techniques can be applied in real-world development workflows.