Can LLMs Enable Verification in Mainstream Programming?
This addresses the problem of low adoption of formal verification in mainstream programming by exploring LLMs as a tool to bridge the gap, though it is incremental as it builds on existing benchmarks and verification methods.
The study investigated whether large language models (LLMs) can generate verified code in formal verification languages like Dafny, Nagini, and Verus, using datasets from the HumanEval benchmark, and found that LLMs can produce verified code with specific information requirements for good results.
Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.