CRMar 21, 2020

An Empirical Study on Benchmarks of Artificial Software Vulnerabilities

arXiv:2003.09561v12 citations
AI Analysis

This addresses concerns about the reliability of artificial benchmarks for evaluating vulnerability detection techniques in software security, though it is incremental as it builds on existing benchmarks.

The study compared three artificial vulnerability benchmarks (LAVA-M, Rode0day, CGC with 2669 bugs) to 80 real-world CVEs, revealing significant differences despite attempts to mirror reality, and proposed strategies to improve benchmark quality.

Recently, various techniques (e.g., fuzzing) have been developed for vulnerability detection. To evaluate those techniques, the community has been developing benchmarks of artificial vulnerabilities because of a shortage of ground-truth. However, people have concerns that such vulnerabilities cannot represent reality and may lead to unreliable and misleading results. Unfortunately, there lacks research on handling such concerns. In this work, to understand how close these benchmarks mirror reality, we perform an empirical study on three artificial vulnerability benchmarks - LAVA-M, Rode0day and CGC (2669 bugs) and various real-world memory-corruption vulnerabilities (80 CVEs). Furthermore, we propose a model to depict the properties of memory-corruption vulnerabilities. Following this model, we conduct intensive experiments and data analyses. Our analytic results reveal that while artificial benchmarks attempt to approach the real world, they still significantly differ from reality. Based on the findings, we propose a set of strategies to improve the quality of artificial benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes