PL AI CL LGSep 25, 2025

InvBench: Can LLMs Accelerate Program Verification with Invariant Synthesis?

Anjiang Wei, Tarun Suresh, Tianran Sun, Haoze Wu, Ke Wang, Alex Aiken

Stanford

arXiv:2509.21629v13 citationsh-index: 12

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of accelerating program verification for developers and researchers, but it is incremental as it benchmarks existing LLMs without introducing a new method.

The paper tackles the challenge of automatically discovering strong loop invariants for program verification by introducing a principled framework to evaluate LLMs on invariant synthesis, showing that while LLM-based verifiers do not yet significantly outperform traditional solvers like UAutomizer, fine-tuning and sampling techniques can improve speedup cases, such as raising Qwen3-Coder-480B from 8% to 29.2% with fine-tuning.

Program verification relies on loop invariants, yet automatically discovering strong invariants remains a long-standing challenge. We introduce a principled framework for evaluating LLMs on invariant synthesis. Our approach uses a verifier-based decision procedure with a formal soundness guarantee and assesses not only correctness but also the speedup that invariants provide in verification. We evaluate 7 state-of-the-art LLMs, and existing LLM-based verifiers against the traditional solver UAutomizer. While LLM-based verifiers represent a promising direction, they do not yet offer a significant advantage over UAutomizer. Model capability also proves critical, as shown by sharp differences in speedups across models, and our benchmark remains an open challenge for current LLMs. Finally, we show that supervised fine-tuning and Best-of-N sampling can improve performance: fine-tuning on 3589 instances raises the percentage of speedup cases for Qwen3-Coder-480B from 8% to 29.2%, and Best-of-N sampling with N=16 improves Claude-sonnet-4 from 8.8% to 22.1%.

View on arXiv PDF

Similar