SEAIPLMar 18, 2025

Can LLMs Enable Verification in Mainstream Programming?

arXiv:2503.14183v15 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the problem of low adoption of formal verification in mainstream programming by exploring LLMs as a tool to bridge the gap, though it is incremental as it builds on existing benchmarks and verification methods.

The study investigated whether large language models (LLMs) can generate verified code in formal verification languages like Dafny, Nagini, and Verus, using datasets from the HumanEval benchmark, and found that LLMs can produce verified code with specific information requirements for good results.

Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes