SEMay 4

Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation

arXiv:2605.0219574.1
Predicted impact top 30% in SE · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers evaluating LLM-based code translation, this work identifies a critical flaw in current evaluation frameworks that inflates failure rates, urging more transparent and configuration-aware standards.

The study reveals that many reported failures in LLM-based code translation are due to evaluation-induced errors (e.g., compilation flags, missing libraries) rather than incorrect logic, based on an empirical analysis of 6,164 translations across five languages and three benchmarks.

Large Language Models (LLMs) have achieved remarkable success in automated code translation. While prior work has focused on improving translation accuracy through advanced prompting and iterative repair, the reliability of the underlying evaluation frameworks has received less attention. In this paper, we demonstrate that a significant number of reported failures in code translation are not due to incorrect logic, but rather evaluation-induced errors stemming from improper compilation flags, missing library links, and unconfigured runtime environments. We conduct a large-scale empirical study across five programming languages (C, C++, Java, Python, Go) and three benchmarks (Avatar, CodeNet, EvalPlus), covering 6,164 translations generated by GPT-4o, DeepSeek-Coder, and Magicoder. Our analysis identifies and categorizes common false negatives, distinguishing pipeline-induced failures that affect any model from model-dependent behaviors that vary across LLMs. Our findings highlight the necessity for transparent, configuration-aware evaluation standards to accurately assess progress in LLM-based code translation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes