CRMar 21, 2020

An Empirical Study on Benchmarks of Artificial Software Vulnerabilities

Sijia Geng, Yuekang Li, Yunlan Du, Jun Xu, Yang Liu, Bing Mao

arXiv:2003.09561v17.22 citations

Originality Synthesis-oriented

AI Analysis

This addresses concerns about the reliability of artificial benchmarks for evaluating vulnerability detection techniques in software security, though it is incremental as it builds on existing benchmarks.

The study compared three artificial vulnerability benchmarks (LAVA-M, Rode0day, CGC with 2669 bugs) to 80 real-world CVEs, revealing significant differences despite attempts to mirror reality, and proposed strategies to improve benchmark quality.

Recently, various techniques (e.g., fuzzing) have been developed for vulnerability detection. To evaluate those techniques, the community has been developing benchmarks of artificial vulnerabilities because of a shortage of ground-truth. However, people have concerns that such vulnerabilities cannot represent reality and may lead to unreliable and misleading results. Unfortunately, there lacks research on handling such concerns. In this work, to understand how close these benchmarks mirror reality, we perform an empirical study on three artificial vulnerability benchmarks - LAVA-M, Rode0day and CGC (2669 bugs) and various real-world memory-corruption vulnerabilities (80 CVEs). Furthermore, we propose a model to depict the properties of memory-corruption vulnerabilities. Following this model, we conduct intensive experiments and data analyses. Our analytic results reveal that while artificial benchmarks attempt to approach the real world, they still significantly differ from reality. Based on the findings, we propose a set of strategies to improve the quality of artificial benchmarks.

View on arXiv PDF

Similar