SE AIFeb 28

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement

arXiv:2603.00539v17.22 citationsh-index: 2

Originality Highly original

AI Analysis

This exposes critical reliability issues in LLM-based code review tools for software engineers, highlighting an under-explored limitation in automated development pipelines.

The paper reveals that large language models (LLMs) systematically misclassify correct code as non-compliant with natural language requirements, with detailed prompts increasing misjudgment rates, and proposes a Fix-guided Verification Filter that uses model-proposed fixes as counterfactual evidence to improve reliability.

Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the reliability of rationale-required judgments. Building on these findings, we propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests. Our results expose previously under-explored limitations in LLM-based code review capabilities, and provide practical guidance for integrating LLM-based reviewers with safeguards in automated review and development pipelines.

View on arXiv PDF

Similar