Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
This work addresses the challenge of reliable and verifiable code generation with LLMs for formal verification, though it is incremental as it builds on existing benchmarks by focusing on compositional aspects.
The authors tackled the problem of evaluating large language models (LLMs) on compositional formal verification by introducing DafnyCOMP, a benchmark with 300 multi-function programs, and found that while LLMs perform well on single-function tasks, their performance drops sharply on compositional tasks due to systematic failures in cross-functional reasoning.
We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.