PLAISEMay 3, 2024

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

arXiv:2405.01787v328 citationsh-index: 40Has CodeICSE
Originality Incremental advance
AI Analysis

This work addresses the substantial human effort in proof-oriented programming for developers and researchers, though it is incremental as it builds on existing SMT-assisted methods with new AI techniques.

The paper tackles the challenge of automating proof-oriented programming by curating a large dataset of 600K lines of F* programs and proofs, and finds that fine-tuned smaller language models like Phi-2 perform comparably to larger models like GPT-4 at lower computational cost, with type-based retrieval augmentation significantly boosting performance.

Proof-oriented programs mix computational content with proofs of program correctness. However, the human effort involved in programming and proving is still substantial, despite the use of Satisfiability Modulo Theories (SMT) solvers to automate proofs in languages such as F*. Seeking to spur research on using AI to automate the construction of proof-oriented programs, we curate a dataset of 600K lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. Our dataset includes around 32K top-level F* definitions, each representing a type-directed program and proof synthesis problem producing a definition given a formal specification expressed as an F* type. We provide a program fragment checker that queries F* to check the correctness of candidate solutions. We also report on an extended version of our dataset containing a total of 940K lines of programs and proofs, with a total of 54k top-level F* definitions. We believe this is the largest corpus of SMT-assisted program proofs coupled with a reproducible program-fragment checker. Grounded in this dataset, we investigate the use of AI to synthesize programs and their proofs in F*, with promising results. Our main finding in that the performance of fine-tuned smaller language models (such as Phi-2 or StarCoder) compare favorably with large language models (such as GPT-4), at a much lower computational cost. We also identify various type-based retrieval augmentation techniques and find that they boost performance significantly. With detailed error analysis and case studies, we identify potential strengths and weaknesses of models and techniques and suggest directions for future improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes