SE CL LGJan 14, 2025

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray

arXiv:2501.08200v129.170 citationsh-index: 7Has Code2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code)

Originality Incremental advance

AI Analysis

This addresses the need for robust benchmarks to assess security risks in LLM-generated code for developers, particularly those with limited security knowledge, though it is incremental as it builds on prior benchmarks like CyberSecEval and SecurityEval.

The paper tackles the problem of evaluating both functional correctness and security in LLM-generated code, introducing CWEval, an outcome-driven framework that reveals a notable portion of functional but insecure code and shows serious inaccuracies in previous evaluations.

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .

View on arXiv PDF Code

Similar