LG PL SEMar 16, 2025

Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters

Jacqueline L. Mitchell, Brian Hyeongseok Kim, Chenyu Zhou, Chao Wang

arXiv:2503.12686v21 citationsh-index: 2Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving LLMs' reasoning capabilities for program verification, which is incremental as it builds on existing methods to analyze failures.

The paper tackled the problem of understanding how large language models (LLMs) reason about program semantics during program verification, by introducing novel prompting strategies for abstract interpretation-based invariant generation and evaluating them on 22 programs from the SV-COMP benchmark suite, analyzing the soundness of generated invariants and patterns in reasoning errors.

Large language models (LLMs) are increasingly used for program verification, and yet little is known about \emph{how} they reason about program semantics during this process. In this work, we focus on abstract interpretation based-reasoning for invariant generation and introduce two novel prompting strategies that aim to elicit such reasoning from LLMs. We evaluate these strategies across several state-of-the-art LLMs on 22 programs from the SV-COMP benchmark suite widely used in software verification. We analyze both the soundness of the generated invariants and the key thematic patterns in the models' reasoning errors. This work aims to highlight new research opportunities at the intersection of LLMs and program verification for applying LLMs to verification tasks and advancing their reasoning capabilities in this application.

View on arXiv PDF

Similar