Efficient Code Embeddings from Code Generation Models
This addresses the need for efficient code retrieval and similarity analysis across programming languages, though it appears incremental as it builds on existing autoregressive models.
The paper tackled the problem of retrieving code from natural language queries and identifying similar code snippets by introducing jina-code-embeddings, a suite of code embedding models that achieve state-of-the-art performance despite their small size.
jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.