SE AISep 16, 2024

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

arXiv:2409.10756v112.622 citationsh-index: 3

Originality Synthesis-oriented

AI Analysis

This work addresses the need for benchmarks in LLM-based software security, but it is incremental as it focuses on establishing a framework rather than advancing model capabilities.

The paper tackles the problem of evaluating large language models (LLMs) for software vulnerability detection and patching in C code, using a dataset of 307 real-world vulnerabilities from the Linux kernel, and finds that LLMs often struggle to distinguish vulnerable from patched code and produce oversimplified patches.

Large Language Models (LLMs) have shown promise in tasks like code translation, prompting interest in their potential for automating software vulnerability detection (SVD) and patching (SVP). To further research in this area, establishing a benchmark is essential for evaluating the strengths and limitations of LLMs in these tasks. Despite their capabilities, questions remain regarding whether LLMs can accurately analyze complex vulnerabilities and generate appropriate patches. This paper introduces VulnLLMEval, a framework designed to assess the performance of LLMs in identifying and patching vulnerabilities in C code. Our study includes 307 real-world vulnerabilities extracted from the Linux kernel, creating a well-curated dataset that includes both vulnerable and patched code. This dataset, based on real-world code, provides a diverse and representative testbed for evaluating LLM performance in SVD and SVP tasks, offering a robust foundation for rigorous assessment. Our results reveal that LLMs often struggle with distinguishing between vulnerable and patched code. Furthermore, in SVP tasks, these models tend to oversimplify the code, producing solutions that may not be directly usable without further refinement.

View on arXiv PDF

Similar