CL SEJan 26, 2022

CodeRetriever: Unimodal and Bimodal Contrastive Learning for Code Search

Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, Nan Duan

arXiv:2201.10866v34.644 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the challenge of efficiently retrieving relevant code snippets for developers, representing a strong incremental advance in code search technology.

The paper tackles the problem of code search by proposing CodeRetriever, a model that learns code semantic representations through large-scale contrastive pre-training, achieving new state-of-the-art results with significant improvements on eleven code search tasks across six programming languages.

In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training. Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level). These results demonstrate the effectiveness and robustness of CodeRetriever.

View on arXiv PDF Code

Similar