SEAILGPLJun 12, 2024

DafnyBench: A Benchmark for Formal Software Verification

arXiv:2406.08467v153 citations
Originality Synthesis-oriented
AI Analysis

This provides a benchmark for researchers working on formal software verification with LLMs, though it is incremental as it builds on existing verification tools and LLMs.

The authors tackled the problem of evaluating machine learning systems for formal software verification by creating DafnyBench, a large benchmark with 750 programs and 53,000 lines of code, and found that the best LLM and prompting scheme achieved a 68% success rate in auto-generating hints for verification.

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes