Language Agnostic Code Embeddings
This addresses the problem of code retrieval across programming languages for developers and researchers, but it is incremental as it builds on existing multilingual code models.
The paper tackled the lack of understanding of multilingual code embeddings by analyzing their cross-lingual capabilities, finding that embeddings have language-specific and language-agnostic components, and isolating the language-agnostic part improved code retrieval tasks with up to a +17 increase in Mean Reciprocal Rank.
Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of multilingual code models. In this paper, we present a comprehensive study on multilingual code embeddings, focusing on the cross-lingual capabilities of these embeddings across different programming languages. Through probing experiments, we demonstrate that code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details, primarily focusing on semantics. Further, we show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks, leading to an absolute increase of up to +17 in the Mean Reciprocal Rank (MRR).