Design, Implementation and Evaluation of a Novel Programming Language Topic Classification Workflow
This provides a reusable pipeline for researchers and practitioners in code analysis and data-driven software engineering, though it is incremental as it builds on existing methods like SVM.
The paper tackled the problem of classifying programming language topics in source code to aid technical decisions and tooling, achieving an average F1 score of 0.90 across topics and 0.75 in code-topic highlight using a multi-label SVM with sliding window and voting on the IBM Project CodeNet dataset.
As software systems grow in scale and complexity, understanding the distribution of programming language topics within source code becomes increasingly important for guiding technical decisions, improving onboarding, and informing tooling and education. This paper presents the design, implementation, and evaluation of a novel programming language topic classification workflow. Our approach combines a multi-label Support Vector Machine (SVM) with a sliding window and voting strategy to enable fine-grained localization of core language concepts such as operator overloading, virtual functions, inheritance, and templates. Trained on the IBM Project CodeNet dataset, our model achieves an average F1 score of 0.90 across topics and 0.75 in code-topic highlight. Our findings contribute empirical insights and a reusable pipeline for researchers and practitioners interested in code analysis and data-driven software engineering.