LGAICLPLSEFeb 18, 2025

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

Stanford
arXiv:2502.12466v311 citationsh-index: 12Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of LLMs' reasoning about program semantics for code-related tasks, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the problem of evaluating whether large language models (LLMs) truly understand program semantics by introducing EquiBench, a benchmark for equivalence checking, and found that the best models achieved accuracies of 63.8% and 76.2% in challenging categories, only modestly above random chance.

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes