SEAIMay 15, 2025

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

arXiv:2505.10443v210 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses the reliability of LLMs in programming tasks by revealing their lack of robust, semantically grounded reasoning, which is critical for developers and AI safety.

The study evaluated whether large language models (LLMs) with up to 8B parameters can reason about Python programs or merely guess, by applying semantics-preserving code mutations; it found that LLMs produce correct predictions based on flawed reasoning in 10% to 50% of cases and often change predictions in response to mutations.

Understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies have assessed LLMs' ability to predict program outputs, most focus solely on the accuracy of those predictions, without evaluating the reasoning behind them. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this work, we evaluate whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated six LLMs and performed a human expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that LLMs trained for code produce correct predictions based on flawed reasoning between 10% and 50% of cases. Furthermore, LLMs often change predictions in response to our code mutations, indicating they do not yet exhibit stable, semantically grounded reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes