A benchmark for vericoding: formally verified program synthesis
This work addresses the problem of generating reliable, bug-free code for developers and researchers by providing a comprehensive benchmark, though it is incremental as it builds on existing vericoding concepts.
The authors introduced the largest benchmark for vericoding, which involves generating formally verified code from formal specifications, and tested it with off-the-shelf LLMs, achieving success rates of 27% in Lean, 44% in Verus/Rust, and 82% in Dafny.
We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark