Evaluating LLM-Generated ACSL Annotations for Formal Verification
This work addresses the problem of automating formal verification for software developers, but it is incremental as it builds on existing tools and datasets.
This paper tackles the challenge of generating accurate formal specifications for C programs by empirically evaluating five automated ACSL generation systems, including rule-based methods and large language models, on a dataset of 506 C programs, finding that the methods vary in annotation quality and proof stability under verification.
Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper empirically evaluates the extent to which formal-analysis tools can automatically generate and verify ACSL specifications without human or learning-based assistance. We conduct a controlled study on a recently released dataset of 506 C programs, repurposing it from interactive, developer-driven workflows to an automated evaluation setting. Five ACSL generation systems are compared: a rule-based Python script, Frama-C's RTE plugin, and three large language models--DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct. All generated specifications are verified under identical conditions using the Frama-C WP plugin powered by multiple SMT solvers, allowing a direct comparison of annotation quality, solver sensitivity, and proof stability. Our results provide new empirical evidence on the capabilities and limitations of automated ACSL generation, complementing prior survey-based work.