SE AIMar 28, 2025

Post-Incorporating Code Structural Knowledge into Pretrained Models via ICL for Code Translation

arXiv:2503.22776v23 citationsh-index: 4IEEE Trans Softw Eng

Originality Incremental advance

AI Analysis

This addresses the problem of handling syntactic structure in code translation for software developers, offering an incremental improvement by post-incorporating knowledge without retraining.

The paper tackles the challenge of incorporating code structural knowledge into pre-trained large language models for code translation, proposing a training-free, model-agnostic method using in-context learning with an information-theoretic exemplar selection approach, which significantly improves model performance as shown in experiments.

Code translation migrates codebases across programming languages. Recently, large language models (LLMs) have achieved significant advancements in software mining. However, handling the syntactic structure of source code remains a challenge. Classic syntax-aware methods depend on intricate model architectures and loss functions, rendering their integration into LLM training resource-intensive. This paper employs in-context learning (ICL), which directly integrates task exemplars into the input context, to post-incorporate code structural knowledge into pre-trained LLMs. We revisit exemplar selection in ICL from an information-theoretic perspective, proposing that list-wise selection based on information coverage is more precise and general objective than traditional methods based on combining similarity and diversity. To address the challenges of quantifying information coverage, we introduce a surrogate measure, Coverage of Abstract Syntax Tree (CAST). Furthermore, we formulate the NP-hard CAST maximization for exemplar selection and prove that it is a standard submodular maximization problem. Therefore, we propose a greedy algorithm for CAST submodular maximization, which theoretically guarantees a (1-1/e)-approximate solution in polynomial time complexity. Our method is the first training-free and model-agnostic approach to post-incorporate code structural knowledge into existing LLMs at test time. Experimental results show that our method significantly improves LLMs performance and reveals two meaningful insights: 1) Code structural knowledge can be effectively post-incorporated into pre-trained LLMs during inference, despite being overlooked during training; 2) Scaling up model size or training data does not lead to the emergence of code structural knowledge, underscoring the necessity of explicitly considering code syntactic structure.

View on arXiv PDF

Similar