SELGOct 20, 2021

JavaBERT: Training a transformer-based model for the Java programming language

arXiv:2110.10404v120 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for machine learning tools in software engineering to improve code quality, though it is incremental as it adapts existing NLP methods to a new domain.

The authors tackled the problem of applying natural language processing models to software code by training a transformer-based model on Java code, resulting in JavaBERT, which achieved high accuracy on masked language modeling tasks.

Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing out the potential benefits of its application. Natural language processing has shown the potential to process text data regarding a variety of tasks. We argue, that such models can also show similar benefits for software code processing. In this paper, we investigate how models used for natural language processing can be trained upon software code. We introduce a data retrieval pipeline for software code and train a model upon Java software code. The resulting model, JavaBERT, shows a high accuracy on the masked language modeling task showing its potential for software engineering tools.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes