LG PLMar 4, 2017

Machine Learning Based Source Code Classification Using Syntax Oriented Features

arXiv:1703.07638v18 citations

Originality Incremental advance

AI Analysis

This addresses the need for automated, accurate language identification in software development, replacing manual or file-extension-based methods.

The paper tackled the problem of automatically identifying programming languages from source code, achieving 99% accuracy in classifying 29 popular languages using a Maximum Entropy classifier.

As of today the programming language of the vast majority of the published source code is manually specified or programmatically assigned based on the sole file extension. In this paper we show that the source code programming language identification task can be fully automated using machine learning techniques. We first define the criteria that a production-level automatic programming language identification solution should meet. Our criteria include accuracy, programming language coverage, extensibility and performance. We then describe our approach: How training files are preprocessed for extracting features that mimic grammar productions, and then how these extracted grammar productions are used for the training and testing of our classifier. We achieve a 99 percent accuracy rate while classifying 29 of the most popular programming languages with a Maximum Entropy classifier.

View on arXiv PDF

Similar