SEAIPLMar 31, 2025

Assessing Code Understanding in LLMs

arXiv:2504.00065v12 citationsh-index: 29Has CodeFORTE
Originality Synthesis-oriented
AI Analysis

This addresses code understanding reliability for developers using LLMs, but it is incremental as it builds on existing evaluation methods.

The paper evaluated Large Language Models' ability to understand code by testing their judgment of semantic equivalence after program transformations, finding failure rates of 41% without context and 29% with generic context. It proposed integrating LLMs with code-optimization tools to improve accuracy.

We present an empirical evaluation of Large Language Models in code understanding associated with non-trivial, semantic-preserving program transformations such as copy propagation or constant folding. Our findings show that LLMs fail to judge semantic equivalence in approximately 41\% of cases when no context is provided and in 29\% when given a simple generic context. To improve accuracy, we advocate integrating LLMs with code-optimization tools to enhance training and facilitate more robust program understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes