How Far Have We Gone in Vulnerability Detection Using Large Language Models
This work addresses the challenge of improving software security through better vulnerability detection, though it is incremental as it benchmarks existing LLMs rather than proposing a new method.
The authors tackled the problem of automated vulnerability detection in software by evaluating the performance of large language models (LLMs) against traditional methods, finding that several LLMs outperform deep learning approaches and static analyzers on a new benchmark called VulBench.
As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of large language models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.