SE AI PLAug 31, 2025

Towards Repository-Level Program Verification with Large Language Models

arXiv:2509.25197v18.03 citationsh-index: 3Has CodeProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages

Originality Highly original

AI Analysis

This addresses the problem of verifying complex, multi-module software projects for developers and researchers, representing an incremental advance over existing function-level methods.

The paper tackles the challenge of scaling automated formal verification to entire software repositories by introducing RVBench, a new benchmark for repository-level evaluation, and RagVerus, a framework that combines retrieval-augmented generation with context-aware prompting. RagVerus triples proof pass rates on existing benchmarks and achieves a 27% relative improvement on RVBench.

Recent advancements in large language models (LLMs) suggest great promises in code and proof generations. However, scaling automated formal verification to real-world projects requires resolving cross-module dependencies and global contexts, which are crucial challenges overlooked by existing LLM-based methods with a special focus on targeting isolated, function-level verification tasks. To systematically explore and address the significant challenges of verifying entire software repositories, we introduce RVBench, the first verification benchmark explicitly designed for repository-level evaluation, constructed from four diverse and complex open-source Verus projects. We further introduce RagVerus, an extensible framework that synergizes retrieval-augmented generation with context-aware prompting to automate proof synthesis for multi-module repositories. RagVerus triples proof pass rates on existing benchmarks under constrained model inference budgets, and achieves a 27% relative improvement on the more challenging RVBench benchmark, demonstrating a scalable and sample-efficient verification solution.

View on arXiv PDF

Similar