SEApr 12

Improving Dynamic Specification Inference with LLM-Generated Counterexamples

Agustín Balestra, Agustín Nolasco, Facundo Molina, Diego Garbervetsky, Renzo Degiovanni, Nazareno Aguirre

arXiv:2604.1076114.0h-index: 19

Predicted impact top 55% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For software developers using contract assertions, this work reduces manual filtering effort by improving the precision of automatically inferred specifications.

The paper addresses the problem of invalid assertions inferred by dynamic specification inference tools like Daikon due to insufficient test diversity. By using LLMs to generate counterexamples, they discard up to 11.68% of invalid assertions and improve precision by up to 7% without affecting recall.

Contract assertions, such as preconditions, postconditions, and invariants, play a crucial role in software development, enabling applications such as program verification, test generation, and debugging. Despite their benefits, the adoption of contract assertions is limited, due to the difficulty of manually producing such assertions. Dynamic analysis-based approaches, such as Daikon, can aid in this task by inferring expressive assertions from execution traces. However, a fundamental weakness of these methods is their reliance on the thoroughness of the test suites used for dynamic analysis. When these test suites do not contain sufficiently diverse tests, the inferred assertions are often not generalizable, leading to a high rate of invalid candidates (false positives) that must be manually filtered out. In this paper, we explore the use of large language models (LLMs) to automatically generate tests that attempt to invalidate generated assertions. Our results show that state-of-the-art LLMs can generate effective counterexamples that help to discard up to 11.68\% of invalid assertions inferred by SpecFuzzer. Moreover, when incorporating these LLM-generated counterexamples into the dynamic analysis process, we observe an improvement of up to 7\% in precision of the inferred specifications, with respect to the ground-truths gathered from the evaluation benchmarks, without affecting recall.

View on arXiv PDF

Similar