SEApr 20

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation

arXiv:2506.0353593.71 citationsh-index: 49Has Code
Predicted impact top 1% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For developers and researchers building multilingual code generation systems, this study provides empirical insights into the challenges and factors affecting cross-lingual code knowledge transfer.

This paper investigates cross-lingual retrieval-augmented code generation (RACG) across 13 programming languages, constructing a dataset of nearly 14K instances. Key findings reveal that cross-lingual knowledge transfer is non-trivial, unequal, and depends on linguistic affinity and pretraining diversity, with limited reliance on natural language information.

Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) has largely focused on single-language settings, leaving their cross-lingual effectiveness underexplored. Multilingual RACG systems are increasingly important for migrating and reusing code across programming languages (PLs), a common yet challenging task in modern software development. To systematically study cross-lingual code knowledge transfer in RACG, we construct a dataset covering 13 PLs with nearly 14K instances. Our experiments reveal three key insights: (1) Knowledge transfer in RACG across PLs is non-trivial even using direct injection. (2) RACG exhibits unequal cross-lingual knowledge transfer, and its efficacy depends on linguistic affinity of PL pair and diversity of LLM pretraining corpus. (3) RACG shows limited reliance on natural language information embedded in code when equipped with a code-specific retriever. These findings provide practical guidance for designing effective multilingual RACG systems. https://github.com/icip-cas/Cross-Lingual-RACG

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes