SEIRLGApr 17, 2022

Addressing Leakage in Self-Supervised Contextualized Code Retrieval

arXiv:2204.11594v1580 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses leakage issues in self-supervised code retrieval, which is important for developers needing relevant code snippets, but it appears incremental as it builds on existing contrastive methods with specific enhancements.

The paper tackles the problem of contextualized code retrieval by developing a self-supervised contrastive training method that uses mutual identifier masking, dedentation, and syntax-aligned targets to address leakage, and introduces a new evaluation dataset. The approach improves retrieval substantially and achieves state-of-the-art results in code clone and defect detection.

We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that our approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes