LG CL SE MLJun 3, 2019

A Language-Agnostic Model for Semantic Source Code Labeling

Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, David Slater

arXiv:1906.01032v19 citations

Originality Incremental advance

AI Analysis

This addresses the difficulty in code search and comprehension for developers by providing scalable, up-to-date labeling, though it is incremental as it builds on existing deep learning methods for code analysis.

The paper tackles the problem of labeling source code at scale for improved search and comprehension by training a language-agnostic deep convolutional neural network on Stack Overflow snippets, achieving a mean AUC of 0.957 over 4,508 tags and 86.6% top-1 accuracy on GitHub documents.

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

View on arXiv PDF

Similar