SEOct 3, 2021

DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa

Guang Yang, Yanlin Zhou, Chi Yu, Xiang Chen

arXiv:2110.00914v113.316 citations

Originality Synthesis-oriented

AI Analysis

This addresses a common task in software engineering for developers and researchers, but it is incremental as it applies an existing model to a specific domain.

The paper tackled the problem of programming language classification for code snippets from Stack Overflow by proposing DeepSCC, a method based on fine-tuned RoBERTa, and showed its competitiveness against nine state-of-the-art baselines across four performance measures.

In software engineering-related tasks (such as programming language tag prediction based on code snippets from Stack Overflow), the programming language classification for code snippets is a common task. In this study, we propose a novel method DeepSCC, which uses a fine-tuned RoBERTa model to classify the programming language type of the source code. In our empirical study, we choose a corpus collected from Stack Overflow, which contains 224,445 pairs of code snippets and corresponding language types. After comparing nine state-of-the-art baselines from the fields of source code classification and neural text classification in terms of four performance measures (i.e., Accuracy, Precision, Recall, and F1), we show the competitiveness of our proposed method DeepSCC

View on arXiv PDF

Similar