CRAILGMay 25, 2022

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

arXiv:2205.12424v1210 citationsh-index: 21Has Code
Originality Incremental advance
AI Analysis

This addresses the need for efficient vulnerability detection in software security, though it is incremental as it builds on existing pre-training methods.

The paper tackles the problem of detecting security vulnerabilities in source code by pre-training a RoBERTa model with custom tokenization on real-world C/C++ code, achieving state-of-the-art performance across multiple datasets and benchmarks.

This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes