Disproving Program Equivalence with LLMs
This addresses the problem of unreliable code evaluation for researchers and practitioners using LLMs, though it is an incremental improvement over existing methods.
The paper tackles the problem of inadequate unit tests for evaluating LLM-generated code by introducing ProbeGen, a whitebox method that searches for counterexamples to program equivalence. It shows that LLMs with execution feedback can disprove 18% of samples considered equivalent by benchmark tests and improve pass@1 by 10% through semantic clustering.
To evaluate large language models (LLMs) for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.