A ground-truth dataset of real security patches
This dataset addresses a critical bottleneck for researchers and developers in software security by providing real, diverse data for tasks like vulnerability detection and NLP analysis, though it is incremental as it builds upon existing data sources.
The authors tackled the lack of large, diverse datasets for training machine learning models in vulnerability identification by creating a ground-truth dataset of 8057 security-relevant commits (equivalent to 5942 patches) from 1339 projects across 146 vulnerability types and 20 programming languages, including 110k non-security commits.
Training machine learning approaches for vulnerability identification and producing reliable tools to assist developers in implementing quality software -- free of vulnerabilities -- is challenging due to the lack of large datasets and real data. Researchers have been looking at these issues and building datasets. However, these datasets usually miss natural language artifacts and programming language diversity. We scraped the entire CVE details database for GitHub references and augmented the data with 3 security-related datasets. We used the data to create a ground-truth dataset of natural language artifacts (such as commit messages, commits comments, and summaries), meta-data and code changes. Our dataset integrates a total of 8057 security-relevant commits -- the equivalent to 5942 security patches -- from 1339 different projects spanning 146 different types of vulnerabilities and 20 languages. A dataset of 110k non-security-related commits is also provided. Data and scripts are all available on GitHub. Data is stored in a .CSV file. Codebases can be downloaded using our scripts. Our dataset is a valuable asset to answer research questions on different topics such as the identification of security-relevant information using NLP models; software engineering and security best practices; and, vulnerability detection and patching; and, security program analysis.