SE IR LGApr 17, 2022

Addressing Leakage in Self-Supervised Contextualized Code Retrieval

Johannes Villmow, Viola Campos, Adrian Ulges, Ulrich Schwanecke

arXiv:2204.11594v154.7580 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses leakage issues in self-supervised code retrieval, which is important for developers needing relevant code snippets, but it appears incremental as it builds on existing contrastive methods with specific enhancements.

The paper tackles the problem of contextualized code retrieval by developing a self-supervised contrastive training method that uses mutual identifier masking, dedentation, and syntax-aligned targets to address leakage, and introduces a new evaluation dataset. The approach improves retrieval substantially and achieves state-of-the-art results in code clone and defect detection.

We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that our approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.

View on arXiv PDF

Similar