80.4SEMar 19Code
TRACE: Evaluating Execution Efficiency of LLM-Based Code TranslationZhihao Gong, Zeyu Sun, Dong Huang et al.
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.
85.5SEMar 17Code
TRACE: Evaluating Execution Efficiency of LLM-Based Code TranslationZhihao Gong, Zeyu Sun, Dong Huang et al.
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of \textit{execution efficiency} remains overlooked. We present \textbf{\textsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. \textsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using \textsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader \textit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as \textit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position \textsc{trace} as a principled foundation for efficiency-oriented evaluation.
76.9SEMay 18
Contextualized Code Pretraining for Code GenerationChen Liu, Qingyuan Liang, Hanwen Zhang et al.
As code generation becomes increasingly central to improving software development efficiency, modern code models are largely trained and evaluated on code with natural-language descriptions. In real projects, developers often implement missing functions under limited project-specific artifacts, while the local call-site context is already available in the surrounding code. This usage context provides actionable cues about expected behavior, but existing models are not explicitly optimized to leverage it reliably, leading to implementations that may not integrate smoothly with surrounding usage in repository settings. In this work, we propose contextualized code pretraining, an invocation-aware framework that integrates calling context into both the training and evaluation of code models. Using static analysis, we automatically extract large-scale caller-callee pairs from real repositories to construct pretraining tasks and benchmarks that condition generation on the calling context. We train CallerGen, the first code models pretrained with invocation-aware objectives spanning multiple sizes, and evaluate them on CallerEval, a new benchmark featuring realistic scenarios. Experiments show that CallerGen outperforms comparable-scale models and remains competitive with larger ones across two benchmarks. Our 220M and 0.5B models achieve 16.58% and 22.81@% pass1, surpassing baselines on CallerEval. These results highlight the importance of calling context in realistic code generation.
67.7SEMar 30
Toward Functional and Non-Functional Evaluation of Application-Level Code GenerationRuwei Pan, Yakun Zhang, Qingyuan Liang et al.
Large language models (LLMs) have achieved strong performance on code generation. However, most prior evaluations focus on snippet-level outputs, such as function generation or repository completion. These settings do not fully evaluate application-level code generation, where the goal is to produce a runnable repository with coherent multi-file structure, dependency support, and end-to-end executability. In addition, real-world software quality depends not only on functional correctness but also on non-functional quality attributes, such as maintainability and security. In this paper, we present RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, RAL-Bench derives a concise natural-language requirement from a high-quality reference project, constructs black-box system tests for both functional correctness and non-functional quality attributes. It also retains only the candidate tests that pass on the reference repository. Under this unified evaluation protocol, functional correctness is measured by the system test pass rate, while non-functional quality is evaluated along five ISO/IEC 25010-inspired dimensions, with per-dimension diagnostics and reference-normalized scoring.We evaluate 16 frontier LLMs under a controlled zero-shot setting with greedy decoding. The results show that functional correctness remains the primary bottleneck in application-level code generation, while non-functional quality also remains challenging. Under our evaluation protocol, no model exceeds a 45\% functional score. These findings suggest that strong performance on existing code generation benchmarks does not yet translate to strong performance on application-level repository generation. This result highlights the need for evaluation settings that directly assess end-to-end repository generation rather than relying only on snippet-level success.
PLMar 7, 2025
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?Qingyuan Liang, Zhao Zhang, Zeyu Sun et al. · pku
Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
SEJun 3, 2025
Rethinking the effects of data contamination in Code IntelligenceZhen Yang, Hongyi Lin, Yifan He et al.
In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. This paper presents a systematic empirical study to investigate the fine-grained data contamination on code intelligence tasks. Our study involves diverse representative PLMs, namely RoBERTa and GPT-2, and LLMs, namely LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration. Experimental results show that, under the pre-training, fine-tuning, and inference paradigm adopted by PLMs, even deliberately injecting paired contamination does not lead to significant performance overestimation. But direct inference or small-scale fine-tuning uncovers the contamination effects. In contrast, LLMs with pre-training and inference paradigm are significantly affected by the paired contamination. Apart from the above, other contamination scenarios have no impact on both PLMs and LLMs. Our findings challenge the conventional belief that contamination inevitably leads to performance overestimation, providing new insights into the evaluation and deployment of code intelligence models.
SEAug 27, 2021
Lyra: A Benchmark for Turducken-Style Code GenerationQingyuan Liang, Zeyu Sun, Qihao Zhu et al.
Recently, neural techniques have been used to generate source code automatically. While promising for declarative languages, these approaches achieve much poorer performance on datasets for imperative languages. Since a declarative language is typically embedded in an imperative language (i.e., the turducken-style programming) in real-world software development, the promising results on declarative languages can hardly lead to significant reduction of manual software development efforts. In this paper, we define a new code generation task: given a natural language comment, this task aims to generate a program in a base imperative language with an embedded declarative language. To our knowledge, this is the first turducken-style code generation task. For this task, we present Lyra: a dataset in Python with embedded SQL. This dataset contains 2,000 carefully annotated database manipulation programs from real-world projects. Each program is paired with both a Chinese comment and an English comment. In our experiment, we adopted Transformer, BERT-style, and GPT-style models as baselines. In the best setting, the generation performance of GPT-style models is better than others, where the AST exact matching accuracy is 24% and 25.5% when using Chinese and English comments, respectively. Therefore, we believe that Lyra provides a new challenge for code generation. Yet, overcoming this challenge may significantly boost the applicability of code generation techniques for real-world software development.