LGSEMay 29

Automating Formal Verification with Reinforcement Learning and Recursive Inference

arXiv:2605.3091446.8Has Code
AI Analysis

This work addresses the problem of automating formal verification for large language models, which is crucial for ensuring the correctness of software, especially in safety-critical applications. The improvements are incremental but show promising directions for future research.

This thesis explores two methods to improve large language models' ability to generate formally verified programs and proofs. Using reinforcement learning with verifiable rewards (RLVR) in Dafny, the verified pass rate on a refined benchmark improved from 9.7% to 31.1%. Additionally, a verifier-guided inference scaffold in Lean, which treats proof generation as structured search, increased the pass rate on an initial VeriCoding pilot set from 46.2% to 69.2%.

Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. First, we train open-source models in Dafny with RLVR using Group Relative Policy Optimization (GRPO) and related variants, assembling generated candidates into complete programs and scoring them with compiler and verifier outcomes. Initial experiments on an APPS-derived Dafny dataset increased verified reward from 2.2% to 58.1%, but revealed specification hacking, where models exploit weak formal specifications instead of implementing the intended solutions. After filtering underspecified and vulnerable tasks, multi-turn RLVR on the refined benchmark improves the verified pass rate from 9.7% to 31.1%. Second, we develop a verifier-guided inference scaffold in Lean that treats proof generation as structured search over decomposed subgoals, verifier feedback, diagnostics, and repair. With a fixed base model, the full scaffold with proof reviser improves pass rate on an initial VeriCoding pilot set from 46.2% under direct repair to 69.2%. On the larger VERINA dataset, whole-task decomposition plus proof reviser solves 7 of 42 previously unsolved tasks. We also introduce Dalek-Bench, a repository-scale Lean benchmark derived from the Rust $\texttt{curve25519-dalek}$ verification project; preliminary results remain weak, indicating that stronger progress evaluation and task-specific tool-use policies are still needed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes