CRAILGJan 30, 2024

Finetuning Large Language Models for Vulnerability Detection

arXiv:2401.17010v594 citationsh-index: 9IEEE Access
Originality Incremental advance
AI Analysis

This work addresses vulnerability detection in source code for software security, but it is incremental as it adapts an existing state-of-the-art model with optimizations.

The paper tackled the problem of detecting vulnerabilities in source code by finetuning the WizardCoder large language model, achieving improvements in ROC AUC and F1 scores over CodeBERT-like models on both balanced and imbalanced datasets.

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes