PLAISEFeb 22, 2025

Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

arXiv:2503.04779v423 citationsh-index: 12ACL
Originality Synthesis-oriented
AI Analysis

This addresses the problem of assessing LLMs' semantic reasoning for programming tasks, which is incremental as it builds on existing evaluation benchmarks.

The paper introduced FormalBench to evaluate LLMs' reasoning on program semantics through formal specification inference, finding they perform well on simple control flows but struggle with complex structures like loops, and improved success rates by 25% with self-repair prompts.

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes